?-> crypt and how sync works

asdffdsa · November 26, 2019, 12:55am

as per the docs
Hashes are not stored for crypt. However the data integrity is protected by an extremely strong crypto authenticator.

so when i do a rclone sync, how are the files compared?
modtime?
size?
is there a reason why hashes are not stored for crypt. i might imagine that the hash might give away the encrypted file.

thanks,
thanks

thestigma · November 26, 2019, 1:22am

modtime+size (both combined, although with a backend-determined "precision margin" for the modtime component as backends store to different levels of precision.)

We can't control the server-side generated hash. Not on most backends anyway. It simply is what the file makes it. We could store it in the file itself though (for that, see further down in the post)

if there are not compatible hashes between the who systems then it falls back to size+modtime. This is usually pretty accurate all things considered.

Of course the core problem here is that the hash of the original file will not be the same as the hash of the encrypted file that is generated on the server-side. And since server can't decrypt to look at the underlying file we are at am impasse.

Not having comparable hashes is not purely a crypt issue pr se. If you sync two encrypted volumes (directly, not through the crypt remote) then these can be hash-compared just fine. But any time you compare non-encrypted to encrypted - or two encrypted systems where files aren't necessarily always originating from one source - then you have a problem.

What further complicates things is the nonce, the "random seed" for the encryption that makes sure that a hashed name is not the same each time for security/obfuscation reasons. This makes it so that we can not simply encrypt locally and hash that to compare with the encrypted file on the server. These will not match even if the file inside is identical.

What can be done is to download the nonce, then using that same nonce, encrypt locally and hash. Then the hashes will match (if the files were identical underneath obviously). Having a function that can automatically do this has been suggested. It actually already exists in the form of rclone cryptcheck, but there's no flag you can use to make use of this technique in a copy/move/sync. This is something that probably should be added...

Lastly, let me inform you that I've had some chats with Nick on this already and we have come to the agreement that it would be wise to bake the original hash into the crypt-format itself (and potentially also other metadata). This would allow to easily access the "original-file-hash" and compare based on that. it's not going to be quite as fast as grabbing all that info from a listing - but it will surely be a worthwhile compromise in the return for the ability to use --checksum, --track-renames and much more between any two remotes - regardless of encryption or not. Even regardless of if there are several different crypt-keys in play.

That hash will have to be generated locally, but because that data will reside within the crypt structure it will be inherently protected against failure by the data-integrity of the format. Thus there shouldn't be any way for the locally calculated hash to "not be true" because it got corrupted on transfer somehow.

An issue has been started on that topic here:

github.com/rclone/rclone

crypt: adding metadata (including hashes) to crypt files

opened 12:42PM - 26 Oct 19 UTC

ncw

enhancement Remote: Crypt thinking metadata

# Proposal Add a fixed size metadata block at the start of crypt files. This… should store metadata (eg checksums, original file name, original modification date). (Maybe this block should be at the end as we will definitely have hashes by then when uploading?) Crypt (v0) currently has a small 32 byte header - 8 bytes magic string `RCLONE\x00\x00` - 24 bytes Nonce (IV) I'd propose increasing this for crypt (v1) to 1k, 2k or 4k (not decided which) - 8 bytes magic string `RCLONE\x00\x01` - 24 bytes Nonce (IV) - 24 bytes Nonce (IV) for the metadata just in case we ever want to overwrite it - 1024 -24 -24 - 8 bytes of secretbox encrypted data (or maybe 2048 or 4096) Storing the hashes for the crypted data would solve the "crypt has no checksums" problem. Storing other metadata is useful for upcoming metadata storage features. A fixed size block makes it very easy to seek in crypt files, and allows calculation of the length of the file without having to read it which is a significant advantage This is backwards incompatible with crypt v0 files. It might be possible to design the remote so it could read both v0 and v1 files. If file name encryption is in use then we could potentially have a different suffix other than `.bin` for v1 files. Otherwise the listings may get the wrong sizes for a mixture of v0 and v1 files. This can probably be worked around to make syncs work correctly but `rclone mount` will almost certainly have problems. --- An idea [from the forum](https://forum.rclone.org/t/crypt-hash-possible/17101/10) could also keep this data in a sidecar file for backwards compatibility and have a process which downloads the data to create the hash and the sidecar file.

Hope this helped. I'm sure you have followups to this as usual
Probably need to wait until tomorrow though...

asdffdsa · November 26, 2019, 4:11pm

i was hoping you would reply with your usual verbosity level set to HIGH.
for once, i have no followups, a credit to you.

i will say that being given my paranoia level, i cannot use crypt until checksums can be checked during sync

thestigma · November 26, 2019, 6:09pm

Whaa? ... Satisfied from the first answer?
Who are you, and what have you done with the real Jojo? O_o

As I said I think we will have a solution for this soon, but here are some workarounds you could consider for the meantime if data-integrity is your utmost concern:

(1)
You could store the data you want to server-sync in rclone-crypt format locally and then transfer that to the server directly (ie. not using a crypt remote on the upload) then you will be able to use checksums and all functions that rely on checksums to your hearts content. Although this probably requires a little bit of reorganization it's not really much of an issue as you can just have a crypt remote mounted into the directory you already store the files right now - where they will still be normally readable as they were before.

It's not elegant, but it does work without issue.

(2)
You could make use of the chunker, as this actually stores the original hash also (to work around the fact that of course the generated hashes for each part won't correspond to the original). If that format is crypted afterwards then it will similarly be safe data-integrity-wise. The main problem I can see with this is that chunker was not designed for storing hashes but to split files. Thus it will just not affect files under a specified size. The alternative of setting that threshold low enough to catch all files would work, but you'd end up with larger files having thousands of parts... So I don't think this is great solution.

If you contacted the author it would probably be an easy change to have a flag that didn't split, but still processed all files though. Only a slight tweaking to the rules would be needed for that - no actual change in functionality - so I doubt it would be work-intensive.

(3)
Lastly, let me make a note that even though we can not currently access the original hash of a crypted remote file - it still has a hash. This hash should be very easily available to us during the upload process (it can be calculated as it is read for transfer), and that should let us have a 100% guarantee that the upload was successful and the file is healthy. I assume that this check is already performed under the hood, because I can not see any reason that it would not do that... So if it is actually protection against transfer-corruption you are mainly worried about you may already have this excellent protection on any backend that support any sort of server-side hashing.

This does not solve the problem of being unable to compare-on-hash later or use functions like --track-renames, but that's really a different issue altogether, not related to data-integrity but to functionality and convenience.

@ncw Could you perhaps verify this for the record and for Jojo's sanity?

ncw · November 27, 2019, 10:31am

Yes we check hashes of the encrypted data on both upload and download.

thestigma · November 27, 2019, 2:48pm

@ncw <3
I knew you were far too good to miss something like that

@asdffdsa Happy Jojo?
So that means it's basically mathematically impossible for an undetected corruption to happen on transfers on a backend that supports server-side hashing (as Wasabi does). The worst that can happen is rclone needs to re-transfer it. I assume that probably your main point of worry.

For the rest - ie. the missing functionality of not being able to checksum-compare against the unencrypted hash - you have to either wait a bit for "crypt V3" , or see if one of the suggested workarounds are acceptable for you as a temporary solution.

asdffdsa · November 27, 2019, 5:51pm

thanks for the clarification.

i will wait for crypt V3.2

i use rclone mostly to copy veeam backup files and 7zip files to the cloud

i need for rclone to checksum the local files, compare to cloud, as there is a chance the local file got corrupted and catch that in the log file when rclone re-uploads the file.

i thought i might use the crypt for other smaller files but my script can use 7zip to encrypt the files and filenames.

thanks much,

thestigma · November 27, 2019, 8:22pm

Sorry, i think I meant v2, not v3. AFAIK there hasn't been a major revision if it before.
Actually, this isn't really major revision we are talking about either. it is just a new improved header.
It may potentially even be partially backwards compatible. Will depend on the implementation spesifics.

system · November 30, 2019, 8:22pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.