When does rclone compute a file hash?

jwink3101 · December 4, 2019, 3:26pm

This is really just out of curiosity, but under what conditions does rclone compute the hash of a file when syncing from a local backend?

I 100% recognize that there are tons and tons of different scenarios so this may depend on the backend too. But let's say we have a bucket-based backend like B2 or S3.

For example, if I do a vanilla sync or copy, does it hash all files after the initial transfer? Or does it compare other attributes and then only hash new files?

What if it is the same thing but with --track-renames? In that case, is it smart about still not rehashing what is present on both sides?

What are some common (again, I know there are probably millions of permutations between different flags) of when hashes are computed vs not?

thestigma · December 4, 2019, 5:03pm

Well, as you said this is complicated and varied based on backend and scenario...

But generally all backends use some sort of hashing. The most common is that the server calculates this for you on the read-in. It is then permanently stored for that file. This normally isn't something rclone has to do - but is just a function of the filesystem the server uses - as some advanced filesystems automatically generate and store hashes on all files.

On your end, you probably do not run a filesystem that stores hashes. These exist for private use too - but they aren't common to see outside of fairly specialized servers. If you use one - you probably know about it already...

This means you need to calculate the hash (and read the entire file) if you need to know it for a comparison. This can be a "costly" operation in terms of time and disk I/O - so when rclone does comparisons for a sync for example it is not standard in the interest of performance since size+modtime is typically good enough for basic comparisons anyway. This will be the basis by which it is decided if you need to actually upload new files ect. Otherwise a frequent sync on a large collection of data would be very burdensome on the disk even when nothing changed most of the time - at least a mechanical HDD (much less of an issue on an SSD which is both much faster and also won't really be worn-out at al by reading). You can force hash-checking with --checksum if needed.

Any time you already have to read the entire file anyway though - calculating the hash can be done almost for free (such as for the upload of a file). The only extra cost is a fairly trivial amount of CPU, so I think rclone generally tends to do this to verify a successful transfer.
@ncw Could you verify this for me perhaps? ^

--track-renames is one of those functions that requires a comparison on the hash.
The server will have the hash stored on it's end, so nothing more is needed there.
On our local end, we will need to calculate it (and thus read every file to be compared fully). Unless you happen to use an advanced filesystem that stores hashes (which is unlikely) this info will not be stored, so there is unfortunately not a lot of room to "be smart" about this.

A natural question may be "why doesn't rclone just store the hashes for me?". The answer to that is that this is a non-trivial problem in terms of data-integrity and implementation. A filesystem can manage this because it has very good low-level control and can detect any changes to any file and make sure hashes are corrected as needed on the fly. Rclone is just a user-program, and it would be extremely challenging do this job robustly.
This is definitely a case of "use the right tool for the job" - and that means that if this is really important to you, you should consider using a filesystem that actually has this feature - like ZFS or similar.
It may be tempting to have rclone be a "do everything" program, but that just isn't a viable strategy in the long-term.

This was more "how and why" than concrete examples, but hopefully it gave some insights.
If you want to follow up with a more spesific question, feel free to ask and I will answer if I can.

jwink3101 · December 4, 2019, 5:45pm

I understand the hash storage issue. I use it in my own sync tool (which as optional rclone support). But for --track-renames, I can imagine a situation when you don't need the hash of every file on the source (you still need it on the dest).

If a source file matches a dest file based on the other checks, then it shouldn't need to rehash. However, if there is a "new" source file (which was just a move/rename), then it should be hashed and compared to the hashes at the dest.

Is that what rclone does?

Obviously if local is the destination, then you again do need to know the hashes.

I will play with some local-to-local copies and the -vv flags to try to deduce what is happening though that is not the same as what should happen.

thestigma · December 4, 2019, 10:20pm

Hmm, I suppose you are right that this could be one way to do it.
I have to admit I do not know, but based on experience I suspect it just hashes everything.
It should be very easy to test - if you just try to re-sync an unchanged set of files again with --track-renames then that wouldn't need to hash anything locally if this were the case. I suspect it hashes everything though, but i have not specifically tested for this scenario.

Perhaps this is a requirement that I am not aware of due to how --track-renames does file-mapping before the sync starts when you use this flag. I know very little about the specifics of that mapping process.

Since you know how to code, it may be easiest to just check the source of github on specifics like this. Or we can call in @ncw and see if he can chime in on it.

Perhaps this might be a good feature improvement to make en issue on:

Not directly related - but I think you will find this project interresting as well:

github.com/rclone/rclone

crypt: adding metadata (including hashes) to crypt files

opened 12:42PM - 26 Oct 19 UTC

ncw

enhancement Remote: Crypt thinking metadata

# Proposal Add a fixed size metadata block at the start of crypt files. This… should store metadata (eg checksums, original file name, original modification date). (Maybe this block should be at the end as we will definitely have hashes by then when uploading?) Crypt (v0) currently has a small 32 byte header - 8 bytes magic string `RCLONE\x00\x00` - 24 bytes Nonce (IV) I'd propose increasing this for crypt (v1) to 1k, 2k or 4k (not decided which) - 8 bytes magic string `RCLONE\x00\x01` - 24 bytes Nonce (IV) - 24 bytes Nonce (IV) for the metadata just in case we ever want to overwrite it - 1024 -24 -24 - 8 bytes of secretbox encrypted data (or maybe 2048 or 4096) Storing the hashes for the crypted data would solve the "crypt has no checksums" problem. Storing other metadata is useful for upcoming metadata storage features. A fixed size block makes it very easy to seek in crypt files, and allows calculation of the length of the file without having to read it which is a significant advantage This is backwards incompatible with crypt v0 files. It might be possible to design the remote so it could read both v0 and v1 files. If file name encryption is in use then we could potentially have a different suffix other than `.bin` for v1 files. Otherwise the listings may get the wrong sizes for a mixture of v0 and v1 files. This can probably be worked around to make syncs work correctly but `rclone mount` will almost certainly have problems. --- An idea [from the forum](https://forum.rclone.org/t/crypt-hash-possible/17101/10) could also keep this data in a sidecar file for backwards compatibility and have a process which downloads the data to create the hash and the sidecar file.

Basically an updated crypt format that keeps metadata (including hash of original file) in the file-header.
This opens up a lot of new possibilities when it comes to syncing, especially the big limitation of currently being unable to use hash-checking at all between an unencrypted and crypted location.

ncw · December 5, 2019, 3:24pm

That is what rclone does. It only hashes "left over" files in the destination after the sync - ones that would have been deleted otherwise.

Roughly speaking --track-renames works by doing --delete-before but instead of deleting those files it checks to see if they match any in the source (by size+hash) and if so renames them. Then it does a normal sync.

It is a slightly complicated algorithm but you can check out the code here if you want: rclone/fs/sync/sync.go at master · rclone/rclone · GitHub