Hasher: trust cached sum

Hello,
I am using Hasher on a remote that doesn't support any kind of checksum. It works fine as a database to keep different sums and I know it isn't intended for what I am trying to use it, but it would be useful to have a way to simply check against the hasher instead of updating cached sums during transfers.

I will make an example:
A file is corrupted during the upload to the remote, cached sum is the right one made locally. I'm downloading the file from the remote again
expected behaviour -> after a given number of retries the transfer fails because the two hashes differ.
actual behaviour -> no retries, once the file is fully read the hash gets updated with the corrupted one.

I am aware of the workaround using the chunker and I know the hasher isn't meant for this, but it's the nearest to my purpose. Maybe this could be a new flag or a new key in section?

5 Likes

yep.. a must feature required..

problem is, a corrupted file gets simply updated in the database.

following the fs/local: implement hashsum cache · Issue #949 · rclone/rclone · GitHub

Yes, this post is in the "Feature" category already.
Do you think it'd be better to ask the same thing on GitHub, too?

In your example, how does the hasher db end up with the right one made locally? Do you mean import it manually with rclone backend import?

Also, how does the corrupted upload happen without triggering a "corrupted on transfer" error?

I think it wouldn't be too hard to add a flag for this, but I'm not sure I fully understand the desired behavior. Is the idea that with this flag, the bolt db becomes essentially read-only, and can only be updated manually with rclone backend import?

I might be wrong, but as far as my understanding of the hasher goes, while uploading to the remote through the hasher the file is hashed locally and an entry in the boltdb is added.

Since the remote I am using don't support checksums, a file might be corrupted during the upload even if it passes the size check, the same way it happened to me before with FTP transfers or while trying the --ignore-checksum flag on remotes such as gdrive.

So yes, pretty much a read-only database since the entries already in there won't get updated while using this potential new flag. But still, new entries are added while uploading through the hasher.
Then again, files on the remote without their hash in the database would be added on their first read like the hasher would normally do.

PS. I have txt docs with the hashes of all the files I already uploaded to my remote, so yes, I was going to use the backend import.
And about this, since those txts come from a couple of different cloud services, the hash types I had to create through rclone hashsum are different.
As it is right now, I would create the hasher with all the types I have (ex. hashes = md5,sha1,quickxor) and since I can only import one type at a time the hasher would add the missing hashes once the file is read.
With the new flag I was asking for I think there should be one more step: it should check first against the only given hash and then add the other ones.

Sorry if I didn't make myself clear.
To put it really simply, just like on any transfer between local and gdrive, for example, rclone would check that hashes match.
Here it'd be the same, but the hash is read from the db since the remote doesn't support checksums.
If they don't match the file is transferred again till the given number of retries runs out.

This is sometimes true, but it depends. If the source does not support all the required hash types, or if it's a backend with "slow" hashes (like local), hasher hashes the file on the fly while it's uploading. Otherwise, it uses the hashes provided by the source.

When it hashes on the fly, this is during the upload, not before, so if your concern is data corruption during the upload, I'm not sure this would 100% protect you. Depending on where in the chain the corruption happened, it's possible that hasher would just be hashing the corrupted file, and then storing that bad hash.

That said, rclone has some built-in measures to detect this, so if the source and dest hashes differ after the upload, it should complain and retry.

It sounds like what you want is read-only when downloading but not when uploading. Do I have that right?

This is making more sense to me now, although I still think there's a possible garbage-in-garbage-out risk with the uploading part, in the event that the hashing is downstream of the corruption. But it could still be useful for detecting any bit-rot that might have occurred after the upload.

This is probably the safest method in terms of ensuring the hashes were generated independently from the upload. But on the other hand, you won't really know if there's a problem until you want to download the file.

I wonder if there should be an additional fully-read-only mode for both upload and download? (i.e. only use manually imported golden sums) Would that be useful, or overkill?

I do think it would be worth opening a feature request issue on GitHub, to keep track of this idea.

I didn't know local backend would be considered slow, but the odds of data corruption hashing on the fly should be pretty low, just my wild guess.

My concern is all about network stability, so at that point the data should have gone through the hashing process already, right?

Kind of.
As it is now, while uploading a file to a remote, the hasher will store its hash in the database.
If the hash of the file changes afterwards (be it for network issues after the sum or anything else linked to the remote) and you download it again, the hasher will check its hash against the one in the database and update the entry with the new hash (-vv shows as such).

Here comes the feature I was asking for: instead of simply update the entry, rclone would retry to download the file until the given number of --retries (since the issue might still be during the download from the remote to local) and then fail the transfer with the error "hashes differ."

This way the entry in the database won't be updated and you can read from logs that file changed from the one you had at the beginning.

PS. It would probably work better as a key in section in rclone.conf rather than a flag, in my opinion.

This works if the remote backend supports hashes, but the cause of my concern was working with a remote that doesn't support them.
From my understanding, rclone should fall back to a size check in that case.

I like this, not really what I'm looking for right now, but I'm sure me or someone else might be grateful to have it in the future, too. I'm always down for ways to keep track of data integrity.

Thanks a lot for your patience on this matter, I will open a feature request there in a couple of days.

It is considered "slow" as far as hashes are concerned because it has to read the whole file and calculate new hashes every time, as opposed to some other backends which "store" a hash in metadata.

I think we are saying the same thing in different ways.

Backend-specific flags automatically create an equivalent config parameter by default, so both options would be available.

In general that's true, although there are some exceptions such as crypt, which does the equivalent of cryptcheck after a transfer to verify integrity, despite otherwise lacking hash support.

Cool, makes sense.

By the way, you may already know this, but another way to achieve the kind of strict integrity check you want is to do rclone check --download right after your copy or sync. --download will read the whole file on both src and dst and make sure it's 100% identical, even if the remote doesn't support hashes. It is probably the most foolproof method of detecting any errors that happened during the upload.

Yes, sorry, I repeated myself quite a bit.

I noticed this, but I wasn't aware it was actually a thing, thank you.

Yes, since there's no way to know if the files I stored in my remote are fine, I use that flag every now and then to check everything I upload against the txts I previously created through rclone hashsum.