Inner workings of 'copy' - is the server queried for each file?

Hi,

I’m looking into using Rclone for personal backup with B2 integration. I’m wondering how the rclone copy command detects that a particular file is already uploaded; or if an
uploaded file has changed? Does it send a query to the server or is this information cached internally? This is important, because B2 is charging for transactions and AFAIK retrieving information about a file is a transaction. If I have many files and want to sync often I will be charged even if no files have actually changed!

Regardless, this looks like a great tool!

If you do a normal rclone copy local remote: then on B2 the directory listing is sufficient to answer all the questions rclone might have about whether a file exists or not - it doesn’t query the server for each file.

You can verify this for yourself by adding the -v --dump-headers flags to rclone - then you can see exactly which http requests it does. Or use --dump-bodies for even more info!

Not all remotes play as nice as b2 - for instance s3 and swift will need to do an additional query per object to read the modification time.

Thank you for the quick response! Your answer seems to make sense when I look at the B2 API. b2_list_file_names lists 1000 files per query (including modification time like you said). So I should be getting one query/transaction per 1K files. Might be worth putting this into the B2 docs.

Yes that sounds right.

It might indeed. Fancy sending me a pull request?

Sure, I just need to find some time to play around with it. I want to check how this works when encryption is used. Should happen this week.

OT the encryption docs don’t make it clear whether the encrypted file is stored in memory or on a drive. Should we expect any problems when encrypting large files, or large directories?

The encryption streams everything, so the files are never held in memory, or on disk, except for B2!

B2 needs to know the SHA1 hash of a file before uploading it (for non chunked uploads). For transfers from local disk or compatible remotes (like onedrive & acd which use SHA1) the file is streamed. However when uploading from an incompatible remote which can’t provide SHA1 sums in advance (like crypt) the b2 code will stream the file to a temporary file on disk, calculate the SHA1, then upload the file, but only for files less than -b2-upload-cutoff - files larger than this will be chunked and uploaded without a temporary file. The B2 remote will store temporary files in your OSes normal place for temporary files, or where the “TMPDIR” environment variable points to.

That should go in the docs too!

I created a config with B2, along with a directory with a few files to upload and copied the files to B2. Then I tried to copy them again and see what requests are sent.

/b2api/v1/b2_authorize_account
/b2api/v1/b2_create_bucket
/b2api/v1/b2_list_buckets
/b2api/v1/b2_list_file_names

B2 counts this as 4 class C transactions. Is calling create_bucket before list_buckets intentional? Maybe this is universal across many providers, but for B2 it would seem better to first list_buckets and then create if necessary.

I’m having problems with copying to an encrypted B2 config:

myfile.txt: Failed to copy: Sha1 did not match data received (400 bad_request)

It tries to create the bucket as it needs to anyway and it will give an error if it exists. Unfortunately with the B2 API list_buckets is the only way to turn a bucket name into a bucket ID. Make an issue about this if you want and I’ll look into optimising it.

Have you tried the latest beta? http://beta.rclone.org/v1.33-72-ga4a44a4/ - I think I fixed that problem already.

This explains why you need to call list_buckets after creating a bucket. I think it would be safe to assume that buckets will exist most of the time. Currently rclone always tries to create the bucket, and then gets its id by list_buckets. You could change this to list_buckets first. Then if the bucket exists we got the id. If it doesn’t, we need to create it and list again. I think that changing this is marginally beneficial - we get one less request when the bucket exists, but one more if it doesn’t.

Thanks, no problems when using this build.

If you’d like to see that change then make an issue at https://github.com/ncw/rclone/issues/new

One more issue before I update the docs.

When copying a directory to B2, with no changes in the directory - 4 requests are sent.

b2_authorize_account
b2_create_bucket
b2_list_buckets
b2_list_file_names

However when we setup an encrypted B2 we get 7 requests for the same scenario.

b2_authorize_account
b2_list_buckets
b2_list_file_names
b2_authorize_account
b2_create_bucket
b2_list_buckets
b2_list_file_names

Seems like a bug?

Hmm, yes that is a bit of inefficiency to do with rclone looking to see whether you were looking for a file or a directory

When you do

rclone copy secret:object

rclone had to work out whether object i a file or a directory. It is doing this in an inefficient way at the moment by recreating the Fs. If you make an issue I’ll look at fixing it.

I’m working on updating the docs. One thing I want to get straight is what happens when we use B2+Crypt and chunked uploads. You wrote that:

files larger than this(-b2-upload-cutoff) will be chunked and uploaded without a temporary file.

However B2 API requires a checksum for each part. So I would expect each chunk to be copied to a temp file so that we can calculate the checksum of each part. Is this correct?

The parts for a chunked upload are held in memory, not on disk. The chunks are 96 MB by default so fairly large.

How about the number of transactions for a sync operation with B2? I have 300,000 files on my root file system, for about 10GB of data. If I run one sync operation, how many chargeable transactions would I expect? What about google cloud storage, which also charges for transactions?