Crypt remote hash and other metadata

disconnect5852 · January 7, 2024, 11:53pm

Since i'm using crypt remote, i'm facing some problems, that are all lead to the lack of the hashes for crypt.
For example, on upload some files is somehow had byte flips, sometimes just few bytes, sometimes many bytes. I can search for these with cryptcheck, but it could be handled on upload if crypt files would have hashes too.
Also, after crpytcheck the sync/copy does not detects that those files are different, as due to byte flips, the size is exactly the same. (BTW how can i force to copy a file even if rclone detects it as same, just like in the situation i wrote? Deleting manually and copy is a bit tiresome)

I saw this topic about crypt file V2 format, but it seems this feature improvement is forgotten: Crypt hash possible? - #12 by ncw
I think this is crucial for crypt remote, as without it's just isn't reliable. I think can be useful for non crypt remotes too, as other metadata could be stored, that the remote isn't support, fe 1fichier doesn't support original file creation times, only modification times..

What's with this feature?

asdffdsa · January 7, 2024, 11:58pm

https://rclone.org/docs/#i-ignore-times
"unconditionally upload all files regardless of the state of files on the destination"

there is an open issue, discussed in detail at https://github.com/rclone/rclone/issues/3667

nielash · January 8, 2024, 10:25am

Not a perfect solution, but some of what you're looking for can be achieved with the Hasher backend:

If you create a hasher remote that wraps a crypt remote, you can essentially use hashes with crypt. Consider setting --hasher-auto-size to a very high value (larger than your largest file) so that it will always recompute hashes on the fly, like local does.

The tradeoff, of course, is that this may use lots of data / storage (calculating a checksum requires downloading the entire file.)

kapitainsky · January 8, 2024, 10:37am

Other option is to add chunker remote to your remotes' mix - not for chunking but to utilise its hash storage feature.

and use settings like:

You can even use chunker to force md5/sha1 support in any other remote at expense of sidecar meta objects by setting e.g. hash_type=sha1all to force hashsums and chunk_size=1P to effectively disable chunking.

I do use it myself. Downside is that a sidecar file is created for any file stored effectively doubling number of files stored. It is not a problem for me though.

This I think most likely indicates some serious issues with your system - faulty disk or RAM. Unless you use software which modifies files without changing any metadata. Any remote hash is stored during upload and if later cryptcheck shows you discrepancies it means that your source data changed.

disconnect5852 · January 8, 2024, 1:41pm

The problem is, hash db should be maintained across the machines that use rclone

disconnect5852 · January 8, 2024, 1:47pm

for 1fichier the sidecar files are problem as it have some sick spam rules, it accepts around 1 transfer per 4 seconds, else it blocks the client for 30 secs, so many small files are not good. Actually a "packer", the opposite of chunker would nice, that packs the smaller files to bigger files on the remote

No problem with the system, only rclone crypt have frw problems, no problem with non crypt remotes, games, etc.

disconnect5852 · January 8, 2024, 6:39pm

I'm trying out hasher, but i have a configuration like this: crypt remote on cloud, two local disk crypt backups, all the same crypt, and it would be better to have a "pre-seed", and use the same hashes for same files.
As i see from the .bolt DB files, the hash lists are tied to the remote set in specific hasher remote. What's the best way to do this?

BTW there is a flaw, in the hasher db the filenames are the decrypted ones

nielash · January 9, 2024, 7:17am

They kind of have to be, because hasher is blind to the encrypted names. Hasher can only see one layer down, into the Crypt remote -- not two layers down, into the cloud remote that crypt is wrapping. But I agree it's not ideal to be storing all the filenames in plaintext locally

FYI it is possible to use Hasher without using the database feature -- you can disable it with --hasher-max-age 0 (but make sure you're using the new v1.65.1 as it fixes a nasty bug with this.)

A few months ago I experimented with making a --crypt-sum flag to force crypt checksum support by generating them on the fly. It worked, but I didn't submit a PR because I figured the Hasher workaround was close enough, and also this would be impractical if you have several TBs of crypted data on cloud remotes (as I do). But your point about the decrypted filenames with hasher is a good one, and perhaps it's enough of a reason to revisit crypt providing hashes itself instead of requiring a (second) overlay. If you're interested, here was my branch that implements this. It is essentially the reverse the approach of "cryptcheck" (instead of re-encrypting the source, it decrypts the dest) which also means it still works when you don't have the source (like if you wanted to compare two different crypt remotes with different passwords.)

One option would be to make a separate hasher remote for each crypt remote. If the crypt remotes all have the same files, you could use a sumfile from one to pre-seed the others, as the checksums are of the decrypted file, which should be identical. (Personally I've never tried pre-seeding, as it kind of defeats the purpose of checksums for me... which is to prove that two files are identical, not just assume this.)

Another option (depending on how the data was copied) is to wrap just one of the crypt remotes with hasher (to verify that encrypted vs. decrypted match) and then compare the encrypted checksums from this one to the other two (which should be possible without hasher, assuming the base remotes all support hashes.) An important caveat here is that this will only work if the files were copied in their encrypted form, due to the way that crypt nonces work. If the files were transferred in their decrypted form (i.e. if they were re-encrypted for each remote), then the base (encrypted) checksums would not be expected to match, as the nonce would be different.

I am curious -- what would your ideal solution look like? Every time I've thought about this problem in the past, I've never been able to think of a good way out of this paradox:

In order to generate a checksum, you need to have access to the whole decrypted file
If you get around that by storing a pre-generated checksum in metadata, you aren't really verifying anything -- at most it tells you what the data used to be, not what it is now. (For example, it wouldn't catch any bit-rot that occurred while in the cloud provider's possession)
If you compare only encrypted checksums, you aren't catching any errors that may have happened during encryption

ncw · January 9, 2024, 12:37pm

Crypt hashes are tricky indeed!

One reason not to store hashes of the unencrytped data is that it is an info leak. Lets say you were storing a known file with a known hash. The provider could potentially work out that you were storing that known file by finding its hash as metadata on your encrypted file.

Likely some/most/all providers just treat a checksum as metadata just stored at upload time. Some providers I'm sure check that metadata from time to time to find bitrot, but not many recompute the checksum on the fly when you ask for it as that is expensive. The only backends which do that are the local backend and the sftp backend as far as I know.

This will detect bitrot when you come to download the file - the metadata hash will not match the hash rclone calculates.

Rclone does this already when uploading and downloading files via crypt. It is a bit behind the scenes, but if your providers supports a hash then when uploading rclone will work out what the hash of the encrypted data should be and check it at the end of the upload. A similar process happens for downloads.

Your point about errors not being caught during encryption is an interesting one.

If your computer had some bad ram which introduces errors in the data before it was encrypted then the file would upload apparently all OK and the checksum would not notice that and the file would download OK.

However rclone cryptcheck would notice the problem.

I'll just note also that rclone chunks the data into 64k chunks for crypt and each of these blocks has a very strong hash in it, so if any of these get corrupted you will get an error on download.

nielash · January 9, 2024, 7:35pm

Fascinating info, thank you! It makes me think that perhaps I ought to be using the --download option in check more than I currently do...

I was actually looking at this recently for something related to bisync. One thing that troubles me a bit is the way it silently ignores blank hashes, even if it doesn't expect them to be blank. (for example, here and here.) I think the intent was to allow comparison with something like Google Docs where the lack of hash is expected on a remote that otherwise supports them -- but it looks to me like it's letting unexpected blanks through too. This came up because while I was testing the --compare PR, I noticed that Google Drive in particular will often (but not every time) return a blank MD5 for a recently uploaded file. My guess is this is because it is pending in some server-side async queue for processing for a short while after uploading (just a guess -- could be wrong.) If I'm right, it seems like there's a possible (but unlikely) scenario where a file is corrupted on transfer but not detected immediately because hash is blank. A subsequent cryptcheck would probably spot this, as I've not yet seen a hash that stays blank forever (but that does require keeping a copy of the original file after uploading.) It would also probably be spotted on download (but what if that's 10 years from now...) It seems to me that there probably ought to at least be an INFO log (if not ERROR) for unexpectedly blank checksums... it's actually yet another project I started tinkering with but then decided to spare you from for the time being (as I know the last thing you need right now is more PRs from me! )

Also, to be clear -- this has never actually happened to me, it's purely tinfoil-hat speculation on my part

This was also a fascinating read. FWIW, I'd vote for putting the metadata in the file header (not a sidecar), and including BOTH versions of the hash (the decrypted original and the encrypted one used by cryptcheck.*) That way rclone can easily read the original back when needed, while also raising an error if the hash reported by the remote does not match our expected value for the other one. This assumes that nothing but rclone can edit the file -- but since this is crypt, that is kind of already the case.

(*I realize one potential problem with this is that the file would need to somehow contain its own checksum...I'm not sure how possible that is, at least without crazy amounts of computing power!)

The unsupported metadata use case is also quite interesting... it would help with the length limit issue I ran into recently on the xattrs ticket! Is this something you're looking for help with? I'd potentially be interested in working on it.

disconnect5852 · January 10, 2024, 9:47pm

How can i make hasher hash only those files that have not yet hashes?

nielash · January 11, 2024, 2:25am

That should be the default behavior. What are your settings for --hasher-max-age and --hasher-auto-size?

You can force a refresh with:

rclone hashsum MD5 remote:path

(replace MD5 if you are using a different hash type)

ncw · January 11, 2024, 9:30am

That is the ultimate paranoia rclone check.

Unfortunately quite a few backends don't have hashes for each file. For example S3 doesn't have hashes for files uploaded with multi-part upload which weren't uploaded with rclone.

The convention in rclone is to ignore blank hashes because of this.

I didn't know it does that. The hashes supported at drive changed relatively recently - I wonder if this is part of the change.

Maybe the google drive backend should sleep for 1 second and retry if it gets a blank hash or something like that.

Hmm, possibly, but I'd worry about this being too noisy with some backends.

Just the kind of speculation we need when thinking about data integrity

I don't think that is possible, alas! You'd need to effectively break a crypto hash to do that. Nobody knows how to do that even for MD5 which is considered broken in other ways. (In a nutshell you can construct two messages which have the same MD5 but you can't alter an existing message to have a given MD5).

Storing metadata is a good idea. There are rather a lot of tradeoffs though which is why that issue has never got off the ground.

disconnect5852 · January 16, 2024, 10:12pm

another thing would be very useful with rclone metadata store: storing file create, modification, or even access times even if the remote doesn't supports these.

Yet another question: if i use union with 2 folders, one is a hasher(crypt) folder, the other is not hasher and not crypted, then how should i setup hasher? Own hashers for each of these folders, or one hasher for the union?

nielash · January 17, 2024, 9:41am

I agree! The thought has also occurred to me that perhaps there should be a new overlay backend just for storing this kind of otherwise-unsupportable metadata, regardless of encryption. Maybe a similar concept to hasher with a local db but for other metadata, or else maybe a sequestered directory of sidecars on the remote, somewhat similar to chunker. Every option has some pros and cons.

I think it kind of depends on what your other remote is. If it already has fast checksums, there's no real advantage to wrapping it with hasher, so you may want to use it just for the crypt upstream. But if it has slow hashes (like local), or the wrong hash type, then there could be an advantage.

I am not yet very familiar with union, so maybe someone who knows it better can chime in about any union-specific quirks to watch out for, but I'm currently wrapping hasher in a combine remote and it works quite well. I found a few small surprises, but easily solvable. My setup is probably excessive, but one of my remotes is now 4-levels deep going from combine -> hasher -> crypt -> drive. It's rclone-Inception

disconnect5852 · January 17, 2024, 10:02am

I think sidecar files are ok for non crypted, but for crypt the files are "packaged" anyway, so metadata should be stored in the package, but this can be also optional for non crypt files.

And maybe a meteadata storage/cache file in every folder, containing metadata for each of its subfolders and files.

ncw · January 19, 2024, 2:31pm

I think you win the highly sought after "most nested remote award" here

nielash · January 19, 2024, 3:31pm

And if anyone tries to take my title, I'll add an alias!

system · March 19, 2024, 3:32pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.