Rclone mount silently returning corrupted information?

durval · August 22, 2017, 4:48pm

Hello everyone,

This is a continuation of my "md5sum -c on a very large tree" tribulations which I initially reported here; I decided to open a new topic as it seems this is a new (and potentially much more serious) problem.

That last "md5sum -c" never finished, because I tried to update my remote copy with an "rclone sync" which then erroneously deleted/removed some files (see here and here if you are curious about the whole story).

After the sync issue was resolved and I got the updated copy of that tree on the remote, again I've started the " md5sum -c" on an rclone mount of it, with the following options:

rclone mount egd: ~/egd --low-level-retries=1000 --dir-cache-time 10m --max-read-ahead 256k

I'm having two different kinds of issues:

It is running very slowly: I did some projections and, at the current speed, it would take about 204 days(!) to finish. This seems really excessive, not only comparing with my previous experience (where it took about a week), but also bandwidth-wise (the machine where I'm running this has about 5-6MB/s (50-60Mbps) sustained bandwith to Google Drive, as reported by single rclone copy commands for example).
I'm back to getting "read: connection reset by peer" errors being reported by rclone mount (v1.37) and back to getting errors in my md5sum -c verification, but not I/O errors, and rather silent (from the reading process point of view) data corruption, as follows:

Output from the above rclone mount process:

2017/08/21 17:45:07 ERROR : REDACTED/REDACTED_20170816Z163008.md5: ReadFileHandle.Read error: low level retry 1/1000: read tcp 172.17.0.2:52728->216.58.210.1:443: read: connection reset by peer
2017/08/22 01:09:13 ERROR : REDACTED/REDACTED/REDACTED/REDACTED/REDACTED/REDACTED/REDACTED.tar.gz: ReadFileHandle.Read error: low level retry 1/1000: unexpected EOF
2017/08/22 02:44:43 ERROR : REDACTED/REDACTED_20170816Z163008.md5: ReadFileHandle.Read error: low level retry 1/1000: read tcp 172.17.0.2:52901->216.58.210.1:443: read: connection reset by peer

Interestingly enough, the first and last lines refer to the file containing the MD5 sums (which is being passed as an argument to md5sum -c); the second line lists a completely unrelated file.

On the md5sum -c side, I have the following:

./REDACTED1/REDACTED2/REDACTED3/REDACTED4/REDACTED6/REDACTED7/REDACTED8/REDACTED9: FAILED
./REDACTED1/REDACTED2/REDACTED3/REDACTED5/REDACTED10/REDACTED11/REDACTED12/REDACTED14/REDACTED15: FAILED
./REDACTED1/REDACTED2/REDACTED3/REDACTED5/REDACTED10/REDACTED11/REDACTED12/REDACTED14/REDACTED16: FAILED
./REDACTED1/REDACTED2/REDACTED3/REDACTED5/REDACTED10/REDACTED11/REDACTED12/REDACTED13/REDACTED17: FAILED
./REDACTED1/REDACTED2/REDACTED3/REDACTED5/REDACTED10/REDACTED11/REDACTED12/REDACTED13/REDACTED18: FAILED
./REDACTED1/REDACTED2/REDACTED3/REDACTED5/REDACTED10/REDACTED11/REDACTED12/REDACTED13/REDACTED19: FAILED
./REDACTED1/REDACTED2/REDACTED3/REDACTED5/REDACTED10/REDACTED11/REDACTED12/REDACTED13/REDACTED20: FAILED

Curiously enough, none of the above seven files reported as FAILED (ie, wrong MD5sums) is the second file reported with a "connection reset" by "rclone mount" above, so potentially they are very separate, different issues.

Anyway, it seems that now I'm (or rather, md5sum -c is) getting corrupted information out of the "rclone mount" reads... and with no indication of I/O errors from it (which I think would be a must, in case rclone mount finds an uncorrectable situation). In other words, if it was another process instead of md5sum, I would be getting totally silent data corruption...

I can run the rclone mount with "-v -v", but that would generate an impossible amount of output (potentially more than the available disk space I have on the machine running it, and would also certainly make the whole thing even slower), so I would like to delay that as much as possible.

Thanks in advance for any hints,
-- Durval.
,

ncw · August 30, 2017, 2:06pm

One thing to bear in mind is that rclone will retry the ReadFileHandle.Read so even though you see an error in the log, the subsequent retry may be fine.

What exactly does "FAILED" mean for md5sum -c? Does it mean the checksum didn't match, or there was some sort of IO error?

durval · September 4, 2017, 4:08pm

Hi Nick,

Yep, you are right; just went through it again and I can confirm that. So there's no issue with the "connection reset" thing, it's Google Drive closing the connection unexpectedly on their end, so rclone just logs it and retries.

It means md5sum was able to calculate the checksum (ie, no I/O errors, no "file not found", etc: the file was there, was read entirely, and the checksum was calculated), but the checksum it not the same as the one recorded in the ".md5" file. In other words, the data that md5sum -c read from the file on the rclone mount directory does not have the same data as the local file that was used to calculate the checksum that went into the ".md5" file at the time the latter was generated; that was why I was calling it a "silent corruption".

I/O errors (or any other issues with reading the file) would be reported properly as such on the md5sum -c output, for example ": read error", etc.

Cheers,
-- Durval.

ncw · September 6, 2017, 2:39pm

OK. What I suspect is something is going wrong in the retry logic when rclone needs to retry. Can you make a simple reproducer for this?

durval · September 8, 2017, 9:46am

Hello @ncw,

I will try and let you know.

Cheers,
-- Durval.

Rclone mount *silently* returning corrupted information?

Rclone mount silently returning corrupted information?