Rclone and corrupted data

Several years back I made the mistake of backing up all of my data in 10GB split file archives with 7zip (all encrypted) and storing them on bluray.

One (and only one) of the split files developed data corruption. The entire archive became unextractable.

I had another backup.

I know that rclone stores individual files which makes it infinitely better then the above solution. But I am curious on how rclone deals with data corruption per individual files. Also, when encrypting file names and file directories does rclone use a metadata file to know which is which, and if so what happens if that metadata file gets corrupted?

IIRC rclone has the option for checksum verification, so that's nice.

As I understand it the encrypted filenames convert straight back to unencrypted names, so they aren't stored anywhere else.

I don't think rclone has any particular extra guards against corruption except that you can force --checksum when using remotes that support it. That will guarantee that the files arrives 100% intact. The chance that data corrupts over time (data rot) on a harddrive is pretty minuscule to begin with, and I wouldn't be surprised if google has a system to track and prevent stuff like this by refreshing files that haven't been modified in a very long time (that's just an assumption on my part, but googles server systems are pretty advanced).

Of course, even if you don't use --checksum you still have basic error-detection going on in the TCP layer and up, so the chance of a file corrupting "in-flight" should also be very low even then, but by all emans use --checksum if you are a bit paranoid about it.

I guess the only way to ever be fully guaranteed against any and all corruption is to have a separate backup set, but I'd consider the risk of files on rclone to be at least no more (and probably less) than any data you store locally. After all, google uses server-grade hardware, error-correcting memory, storage-redundancy and much more that usually outside the scope (and need) of most normal users.

Thanks! I will be using gdrive for backups but my primary usage is going to be encrypting a ceph cluster with rclone. Ceph has replication and checks data integrity, but I'm still curious on how rclone deals with it.

Based on your reply it sounds like the data has to match exactly. What happens if there is a random bit/byte in the encrypted file that's corrupted? Will it fail to decrypt, or will rclone just skip over the section of the file that's corrupted, provided it's a very small section?

This is a pretty low-end technical question that I am probably not qualified to answer fully, but I would strongly expect that the decryption would fail - or at the very least the decrypted output would be garbled and useless.

I think this is basically unavoidable in any type of encryption though - that they need to remain intact to recover. Any error-safety has to be added on top of that to keep the file healthy.

I guess what you can potentially do as an alternative to an independent second backup (at least for long-term-storage data) is to look at a tool that can create parity files fro your data. These will then be able to repair/restore partially damaged files even if they happen to be encrypted. The more parity data you have the more corruption you will need to get before the file becomes unrecoverable, so for example 10-20% worth of parity data would give a lot of extra security.

Winrar (and other similar programs) have this as a built-in feature and you can use some more data to add such parity data as part of the archive itself so it can be self-repairing. If you aren't interested in setting up a whole system to automate generating parity data and just need an extra layer of safety on a smaller subset of your data then packing down your files with a decent size recovery archive might be something you could look into. It's more of a hassle and thus maybe not ideal for data you frequently use, but for long-term archiving that probably isn't much of a concern.

Well, I know that most files won't be usable in the event of data corruption. But I've observed first hand how image files can still be viewable even with corruption. It won't look as pretty and will be missing sections, but some/most of the image can be recovered.

It would be nice if this was possible with rclone. I can think of a few ways to go about implementing this, but I'm not sure if that would compromise data security in the event of a physical attack.

The structure of rclone with remotes wrapping around others means it is probably very doable to implement a "parity remote" which can automatically generate such recovery data on the way out and add it alongside the rest of the files being copied.

If you want to have a go at it, just make a pull request on github and NCW will review the contribution, do integration testing and add it into the project. It sounds like something I might be interested in too :wink:

1 Like

rclone is more the transport though.

If data gets corrupted on my Google Drive after I upload it, nothing that rclone does would really fix that as it really just is moving the bits and bytes across to us.

You have options to validate data by using rclone check to make sure source matches destination.

You'd have to fall back to general principles of having data in a few places and validating the data depending on how important it is.

As you stated, you could use rclone as a medium for multiple remotes with something on top to handle those types of challenges / issues.

The rclone crypt backend chunks the files into 64k blocks. Each of those blocks has a very strong checksum/authenticator so rclone will notice any corruption there. Currently rclone will give up at that point, however I did make a version of rclone which write the corrupted output and carry on for data recovery purposes.

If you aren't using crypt then rclone will check the checksum of the downloaded file vs the checksum the provider gives and report an error for that file.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.