Rclone copy to Azure efficiency

tterry · March 13, 2020, 9:46pm

Hi

I'm looking into using rclone copy for a one-way sync from a local mounted drive up to Azure Blob Storage.

Since I have a very large number of files, my question is this: does rclone client call out to Azure for every file to get the md5sum in order to decide whether to upload, or does it keep some kind of local cache of such values?

Thanks,
TT

asdffdsa · March 13, 2020, 11:54pm

hello and welcome to the forum

have you read this?
https://rclone.org/commands/rclone_copy/

thestigma · March 14, 2020, 2:28am

If you want to sync you should probably use the sync command rather than the copy command.
If you are unsure about the differences I recommend you read the documentation first. The gist of it is that copy will never delete anything unless it is to overwrite a file with the same name in the same place.
sync will delete any files that do no longer exist on the source (thus making an exact clone of your data on the other side).

rclone can cache file attributes - but probably not in the way you mean.
it would not be very secure to use unverified old cached data to sync as the files on the cloud could have changed since last time.

But no - rclone does need to ping every file and ask it for it's attributes.
While I am not very familiar with Azure in particular, basically all cloud storage uses listings (and from what I can see AzureBlob does also). rclone will ask for the listing data from the server (which is just a little text that contains all names, hashes and other attribute data). This listing will be for many files at once. Usually all files in a logical folder pr request - and this process is multithreaded as much as you set it to be / the server can handle. If the cloud-system supports fast-list (also called recursive listing) then rclone can ask for a recursive listing from the server for whole folder-trees all at once. This greatly increases efficiency. It can easily be 15-20x faster than not using it (it does use some extra memory though, but just as much as it takes to store these listing texts)

The result is that you can pretty easily get a lot of file-attributes (which rclone then makes choices based upon) from relatively few API calls, and at a pretty good speed. If your folder hierarchy is relatively flat then it will be even more efficient. In AzureBlob it does up to 5000 files in a single listing request by default (configurable).

It would be a good idea to read though the documentation page for AzureBloc if you haven't already:
https://rclone.org/azureblob/

Can I ask what the background for the question is?
Are you worried about what it will cost you to sync based on operation-type? Or is the worry how it will perform? Knowing your motivation will make it easier for me to give you pertinent advice and recommendations.

ncw · March 14, 2020, 9:54am

Rclone will, by default check the size and the modification date of each file to see if it needs to be uploaded - this is quick to read locally and no cache is needed. When rclone uploads a file to azure it sets the modification time and this is read back in the directory listings rclone requests.

Using --fast-list with azureblob will use fewer transactions but will use more memory locally.

Note that on S3 and Swift, reading the modification time does take an extra transaction but it doesn't on Azure Blob or GCS. I should probably put this in a column in the overview as it is quite important information for optimization.

thestigma · March 14, 2020, 3:52pm

Hmm, that is something I didn't know anyone did. nice to be aware of.

There is really no way around carefully reading (and understanding) the pricing documentation for the backend you use, as the specifics can vary a lot.

If the OP needs help understanding how the operation-types as defined in the AzureBlob pricing documentation relates to rclone then I will be happy to help with that But generally the listing-type operations are in the cheapest tier of operations. You often pay pr. tens-of-thouansds of that tier (based on experience with Google Cloud storage).

system · May 14, 2020, 11:52am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.