Faster, non-cryptographic hashing algorithm for faster file comparison

MD5 and SHA-1 were designed to be cryptographically secure hashing algorithms. While fairly fast for small files, comparing large numbers of large files can become prohibitively slow, which is what leads me to request an implementation of a faster, non-cryptographic hash algorithm meant for file comparison. The fastest I've found is xxHash3.

Considering that the collision risk of xxHash3 is acceptably low even in combinatorial situations (where each hash is compared to all other hashes), the risk of collision in a one-to-one scenario such as file comparison should be negligible (~3.14 collisions per billion comparisons).

I'm not a programmer so I'm not sure how difficult it would be to include in rclone, but the github page seems well-maintained and the developer seems to be actively supporting it, so that seems to be a good sign.

Please let me know what you think.

Rclone matches the hashes it uses with the hashes the cloud provider uses.

Compared to transferring stuff over the network MD5 is very quick.

Unfortunately I don't think there are any cloud providers which use xxHash3

It could be used for local -> local transfers - rclone will use MD5 by default.

if you want xxhash3 for local, i have been using this for many years.

image

What about with rclone check --download or rclone move -c from a service that doesn't include hashing? Those are two cases where the file would have to be hashed on both ends

Thanks! Maybe I can mount my drive and the use this.

not sure what you are trying to accomplish, what is your use case?

what service are you using that does not support checksums?

I'm downloading/syncing files from Mega

rclone check --download should work.

so mega does not perform a checksum and/or save a checksum as metadata on upload?
for each upload, you are forced to re-download all the data to perform a checksum?

if you do not mind, why did you choose mega over other cloud providers?

and using a rclone mount would not be a good solution.

  1. need to use fuse compatibility layer.
  2. the slow download speed is the problem, not the fast checksum calculated by rclone

Wondering what the use case to request a faster hashing since disk IO bandwidth is usually reached before CPU 100% utilisation?

Probably there's a way to play with --checkers flag to increase parallelism for checksum calculation?

If you want to see whether it makes any different you can experiment with rclone's existing hashes.

$ rclone hashsum
Supported hashes are:
  * MD5
  * SHA-1
  * Whirlpool
  * CRC-32
  * DropboxHash
  * MailruHash
  * QuickXorHash

Of those CRC-32 will be the fastest, so let's try reading a 1GB file off SSD with cat or hashsum on my 4 year old laptop with SSD.

cat CRC-32 MD5 SHA-1
2.53s 2.57s 3.24s 3.17s
404 MiB/s 398 MiB/s 316 MiB/s 323 MiB/s

So using a faster checksum will help a bit, but not a huge amount as MD5 is already pretty fast.

If there is a go package for xxhash then it would be very easy to integrate.

Here is a Go implementation:

I agree with your explanation though; disk/network speed will almost always be a limiting factor or close to it. The only case where xxhash stands out is when comparison occurs during RAM or if downloading doesn't happen until a hash is complete, but I bet rclone is probably pretty good at downloading and hashing together, so this should probably be a low priority.

1 Like