A tale of warning: silently corrupted/unreadable data on remote


#1

Hello everyone,

TL;DR: apart from anything you upload to “the cloud”, keep multiple local, physical copies of your backup data and verify them all (local and remote) periodically.

This is just to tell my tale of a recent near-woe: recently I uploaded a volume with ~2TB and ~230K files to “the cloud” using rclone: I first uploaded almost all of it to ACD, then I migrated everything from ACD to GDrive, and finally I uploaded just the missing files (which had names too big for ACD, etc) to GDrive; all of these were done using “rclone copy” commands, and the last one (from local to GDrive) was done twice, checking that the last “rclone copy” returned 0 files Transferred and 0 Errors (so supposedly the entire local volume was then all in the remote).

Having been burned in the past by not verifying everything, I then used “rclone mount” to mount the remote volume locally, and checked it by checking it file by file against a locally generated .md5 file (generated using md5sum). To my surprise, no less than 88 remote files were corrupted/unreadable: 75 of them with the (supposedly already corrected) “failed to authenticate decrypted block - bad password” bug, and the other 15 were simply not found (!) on the remote.

So folks, if you are uploading any data to “the cloud” that you don’t want to lose (and if you did want to lose it, you wouldn’t be going to the trouble of uploading it in the first place, right?), do yourself a favor and check everything you uploaded before considering it safe, and then recheck it periodically. Oh, and also check your local copies periodically too.

As the the famous meme says: “There is no cloud – it’s just someone else’s computer”, and like any computer it’s not only probable, it’s expected to go on the fritz from time to time and take your data with it. And rclone, wonderful as it is, only adds more complexity (and so more chance of bugs) to the mix.

EDIT: I opened an issue on Github about that problem, see https://github.com/ncw/rclone/issues/999

Cheers,

Durval.


#2

An interesting story thanks for sharing.

If you try rclone check does that see the same missing files?

I see you’ve made an issue about this - I will comment further there.


#3

As I just posted at https://github.com/ncw/rclone/issues/999#issuecomment-271144656:
apparently it can’t see them; I can’t say as it seems “rclone check” only works with directories, not individual files, and when I point it to that directory, it returns “16 hashes could not be checked” but does not elaborate on that (for example, by listing which hashes could not be checked, even with “-v” specified).

Sorry for not including it on my original post above, just added it (and the correct path is https://github.com/ncw/rclone/issues/999, with plural “issues”).

Cheers,
Durval.


#4

In the case of a crypt remote, it won’t check any hashes. I was interested mostly in its report of missing files if you run it over the whole remote, I wanted to see if it noticed the 15 missing files.


#5

Hopefully we can get his soon https://github.com/ncw/rclone/issues/1033


#6

Hello.
I am quite new on this forum but I am using RCLone for quite long time.
The only purpose is to backup my NAS and my computers to che clound ( ACD in my case ) and keep the backup encrypted. We are talking about 4 TB of data.
I tested the restore process just at the beginning, in order to have an idea how to safely perform it.
But then… after reading your experience, I am very very warried.

Where is the fault? In RClone? in ACD? in GDrive?
Anyway, regardless the fault, how to safely check what we have in the cloud?

Regards.


#7

Hello Sonus,

If you have any data anywhere (no matter how it got there, via rclone or any other tool or mean) and haven’t closely and conclusively verified it, I can basically guarantee that at least some of your data is corrupt.

We still don’t know. Basically my fault, given I haven’t had the leisure to continue the tests with @ncw since last week.

But if I had to guess, I would point the finger squarely at GoogleDrive. rclone is an extremely well written & maintained piece of software, I don’t think it could be corrupting any data except at very extreme and exceptional situations (eg, GDrive performing in non-documented and/or anti-intuitive/destructive ways in certain circunstances, etc).

Just like you verify any kind of information: using hashes.
To be more precise:

  1. Calculate a hash for each and every local file that you will copy (or have already copied) to the cloud (google “find … -type f … -print0 | xargs -0 md5sum >hashfile.md5”, I’m pretty sure you will find many examples);
  2. Mount the remote cloud volume using “rclone mount”
  3. Verify the calculated hashes on the mount point using "md5sum -c " plus the aforegenerated hash file
  4. Verify and fix any errors.

Remember: the only thing standing between you and complete data loss is your own diligence; keep at least three copies of everything and check them all periodically.

Cheers,

Durval.


#8

OK thanks, I will look for a script to do the job.

Meanwhile, since my fiber is a 100/20 Mbps and it tooks more or less 20 days to upload 4 TB,
how long will be the process to calculate the hashes versus a complete upload ? is it a matter of hours or what?

Thanks a lot


#9

Hi Sonus,

The hash calculation is done locally (ie, “md5sum”) with your local copy of the files, and should be as fast as your local disks. The hash verification (ie, “md5sum -c”) is what is done remotely. As you have 5 times more download bandwidth than upload, it should be also 5 times as fast to verify the hashes (which consists of downloading data) than the 20 days it took for you to upload it, so we’re talking about around 4 days.

Of course it all depends on whether your data is concentrated in a few very large files or spread among a quintizillion of very small files (the latter case would take much longer than the former, and could approximate your original 20 days to upload), whether Amazon will put any throttle on you, etc.

Cheers,

Durval.


#10

OK.
Reading the RClone docs, I can see “MD5/SHA1 hashes checked at all times for file integrity”.
Does it mean that every time I perform a new upload, then the uploaded file is checked with MD5?
If it is so, then during a new upload the file integrity should guarantee.

If I am correct, then a curruption can occurr just after… let me say in the next days\months\years.


#11

Excuse me,
can we use rclone check source: destination: for this purpose?


#12

Hi Durval,

Please look at the following posting I made earlier, about an issue I am still trying to figure out. This might be related:

It is very possible that the “other 15 files” that were not found on the remote are still actually at Google Drive, but Google is not giving back a complete list of the files when asked to by rclone… a possible bug in their API.

What happens when you search for these filenames in the Google Drive web interface?

In my case, the files were actually there, and depending on some other variables, they sometimes “go missing” when addressed through the GDrive API by other apps (not just rclone).

I have escalated this through Google Support for Apps (GSuite), and am still trying to get data, but the last word from Google was (basically) that they weren’t interested in my problem because it involves 3rd party apps.

The odrive folks have been somewhat helpful in helping me track things down. I got sidelined on some other work, but will get back to this relatively soon.

– madison


Rclone not see all files in dir at Google Drive
#13

Oh, two other points:
Nick has been extremely helpful in trying to track this problem down as well!

And the other point is I haven’t seen any corruption on Google Drive after over a year of pushing files up there, and checking the md5sum of each and every file. This problem that I describe is a new one.


#14

Hello Sonus,

I don’t think so; if I’m understanding things right (@ncw, please correct me if I’m wrong), on an encrypted remote it would only be able to check the size and modification time, not the MD5 (and, on remotes which don’t support meaningful modification times like ACD, not even that, just the size).

In other words, “rclone check” would say everything is OK, but the contents of one or more files (which is what we’re interested in) could very well have been corrupted.

Cheers,

Durval.


#15

Yes that is correct.

A recent rclone will tell you how many files it couldn’t check the checksum for too.


#16

Hi @madison437 ,

Thanks for the heads-up. I just gave the issue you linked a read, and posted some questions there.

It is very possible that the “other 15 files” that were not found on the remote are still actually at Google Drive, but Google is not giving back a complete list of the files when asked to by rclone… a possible bug in their API.

This is an intriguing possibility. In fact, albeit not the worse problem here, I keep getting these in my more recent volume verifications using “md5sum -c” over “rclone mount”; I’ve investigated it further and it seems that so far these “not found” errors, contrary to the “failed to authenticate decrypted block - bad password” errors, are all transient: if I do a “rclone cat remote:remotedir:remotefile | md5sum -”, not only the command succeeds but also the printed checksum is correct.

I’ve not tried to read these “not found” files over the “rclone mount” again (the above “rclone cat|md5sum -” is for me a better check), but I suspect strongly that, if I do, they will succeed and other files will then be reported as “not found”; I think this is the case because, after I “fixed” these files along with the “failed to authenticate decrypted block - bad password” ones by deleting them in the remote and copying them over again, I did another “md5sum -c” over the “rclone mount” and these files were not reported with any errors, but other files were (at which point I used the aforementioned “rclone cat|md5sum -” to verify those and, as they returned OK, I then moved on). I think they are serious, but not so serious as the data seems to be on Google Drive, at least in the specific cases I have gone through here.

What really, really worries me here (and will have me doing these “md5sum -c” over my entire remote volumes for a long time) is the “failed to authenticate decrypted block - bad password” errors… these seem to indicate silent, file-level data corruption, and could lead to serious data loss if stumbled upon once the original data is lost or deleted (ie, on a restore from the Google Drive copy).

Cheers,
Durval.


#17

Hi Nick,

Thanks for the confirmation.

Humrmrmr… OK, so recent rclone keeps a specific counter for the files it can see in the source path and not on the remote path. This is good to know, thank you.

Cheers,
Durval.


#18

I meant that if when checking src & dst files, rclone can’t check their checksum is the same, it will increase a counter.


#19

Hi,

Silent corruption is really bad…

What if we can have some way to verify local: remote: in a incremental way. Like

Rclone verify --dbpath=C:\4TB.db local: remote:

So each time rclone copy or rclone sync completes.
We can run the above command to verify at least there is no corruption in upload or download

Local dB should store details like sha1 or md5 hashes, last left state to resume etc.