Crypt remote hash and other metadata

nielash · January 9, 2024, 7:17am

They kind of have to be, because hasher is blind to the encrypted names. Hasher can only see one layer down, into the Crypt remote -- not two layers down, into the cloud remote that crypt is wrapping. But I agree it's not ideal to be storing all the filenames in plaintext locally

FYI it is possible to use Hasher without using the database feature -- you can disable it with --hasher-max-age 0 (but make sure you're using the new v1.65.1 as it fixes a nasty bug with this.)

A few months ago I experimented with making a --crypt-sum flag to force crypt checksum support by generating them on the fly. It worked, but I didn't submit a PR because I figured the Hasher workaround was close enough, and also this would be impractical if you have several TBs of crypted data on cloud remotes (as I do). But your point about the decrypted filenames with hasher is a good one, and perhaps it's enough of a reason to revisit crypt providing hashes itself instead of requiring a (second) overlay. If you're interested, here was my branch that implements this. It is essentially the reverse the approach of "cryptcheck" (instead of re-encrypting the source, it decrypts the dest) which also means it still works when you don't have the source (like if you wanted to compare two different crypt remotes with different passwords.)

One option would be to make a separate hasher remote for each crypt remote. If the crypt remotes all have the same files, you could use a sumfile from one to pre-seed the others, as the checksums are of the decrypted file, which should be identical. (Personally I've never tried pre-seeding, as it kind of defeats the purpose of checksums for me... which is to prove that two files are identical, not just assume this.)

Another option (depending on how the data was copied) is to wrap just one of the crypt remotes with hasher (to verify that encrypted vs. decrypted match) and then compare the encrypted checksums from this one to the other two (which should be possible without hasher, assuming the base remotes all support hashes.) An important caveat here is that this will only work if the files were copied in their encrypted form, due to the way that crypt nonces work. If the files were transferred in their decrypted form (i.e. if they were re-encrypted for each remote), then the base (encrypted) checksums would not be expected to match, as the nonce would be different.

I am curious -- what would your ideal solution look like? Every time I've thought about this problem in the past, I've never been able to think of a good way out of this paradox:

In order to generate a checksum, you need to have access to the whole decrypted file
If you get around that by storing a pre-generated checksum in metadata, you aren't really verifying anything -- at most it tells you what the data used to be, not what it is now. (For example, it wouldn't catch any bit-rot that occurred while in the cloud provider's possession)
If you compare only encrypted checksums, you aren't catching any errors that may have happened during encryption