When does rclone compute a file hash?

thestigma · December 4, 2019, 5:03pm

Well, as you said this is complicated and varied based on backend and scenario...

But generally all backends use some sort of hashing. The most common is that the server calculates this for you on the read-in. It is then permanently stored for that file. This normally isn't something rclone has to do - but is just a function of the filesystem the server uses - as some advanced filesystems automatically generate and store hashes on all files.

On your end, you probably do not run a filesystem that stores hashes. These exist for private use too - but they aren't common to see outside of fairly specialized servers. If you use one - you probably know about it already...

This means you need to calculate the hash (and read the entire file) if you need to know it for a comparison. This can be a "costly" operation in terms of time and disk I/O - so when rclone does comparisons for a sync for example it is not standard in the interest of performance since size+modtime is typically good enough for basic comparisons anyway. This will be the basis by which it is decided if you need to actually upload new files ect. Otherwise a frequent sync on a large collection of data would be very burdensome on the disk even when nothing changed most of the time - at least a mechanical HDD (much less of an issue on an SSD which is both much faster and also won't really be worn-out at al by reading). You can force hash-checking with --checksum if needed.

Any time you already have to read the entire file anyway though - calculating the hash can be done almost for free (such as for the upload of a file). The only extra cost is a fairly trivial amount of CPU, so I think rclone generally tends to do this to verify a successful transfer.
@ncw Could you verify this for me perhaps? ^

--track-renames is one of those functions that requires a comparison on the hash.
The server will have the hash stored on it's end, so nothing more is needed there.
On our local end, we will need to calculate it (and thus read every file to be compared fully). Unless you happen to use an advanced filesystem that stores hashes (which is unlikely) this info will not be stored, so there is unfortunately not a lot of room to "be smart" about this.

A natural question may be "why doesn't rclone just store the hashes for me?". The answer to that is that this is a non-trivial problem in terms of data-integrity and implementation. A filesystem can manage this because it has very good low-level control and can detect any changes to any file and make sure hashes are corrected as needed on the fly. Rclone is just a user-program, and it would be extremely challenging do this job robustly.
This is definitely a case of "use the right tool for the job" - and that means that if this is really important to you, you should consider using a filesystem that actually has this feature - like ZFS or similar.
It may be tempting to have rclone be a "do everything" program, but that just isn't a viable strategy in the long-term.

This was more "how and why" than concrete examples, but hopefully it gave some insights.
If you want to follow up with a more spesific question, feel free to ask and I will answer if I can.