`rclone md5sum` on a large Encrypted GDrive dirtree is generating multiple errors

What is the problem you are having with rclone?

Using rclone md5sum to recursively produce MD5 hashes for all files in a very large Encrypted Google Drive dirtree
produces multiple errors like the following:

2023/02/18 15:24:43 INFO  : can't close account: file already closed
2023/02/18 15:24:43 ERROR : REDACTED10/REDACTED11/REDACTED12.dd.gz: failed to copy file to hasher: failed to authenticate decrypted block - bad password?

So far, after almost 2 weeks running non-stop and having generated the MD5 hashes for 4.3M files (about 1/3 of my total 13M+ files), it has generated 21 'sets' of the above 2 lines.

So far, it has also generated (only once each) two other errors:

2023/02/19 08:50:12 ERROR : REDACTED20/REDACTED21/REDACTED22.JPG: failed to open file REDACTED20/REDACTED21/REDACTED22.JPG: open file failed: googleapi: Error 400: Bad Request, failedPrecondition

and

2023/03/01 10:49:56 ERROR : REDACTED30/REDACTED31/REDACTED32/REDACTED33/REDACTED34/REDACTED35/REDACTED36/REDACTED37/REDACTED38.pdf: failed to open file REDACTED30/REDACTED31/REDACTED32/REDACTED33/REDACTED34/REDACTED35/REDACTED36/REDACTED37/REDACTED38.pdf: open file failed: googleapi: Error 401: Invalid Credentials, authError

The strange thing is, if I run a separate rclone md5sum for these files,

Run the command 'rclone version' and share the full output of the command.

rclone v1.61.1

  • os/version: debian 11.1 (64 bit)
  • os/kernel: 6.0.0-4-amd64 (x86_64)
  • os/type: linux
  • os/arch: amd64
  • go/version: go1.19.4
  • go/linking: static
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Encrypted remote on top of a Google Drive remote.

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone -v --rc --rc-addr 127.0.0.1:5570 --transfers=8 --checkers=32  --low-level-retries=1000 --retries=1000 --drive-chunk-size=64m md5sum --download --output-file ~/ENCRYPTED_GOOGLE_DRIVE.md5 ENCRYPTED_GOOGLE_DRIVE:```

The rclone config contents with secrets removed.

[ENCRYPTED_GOOGLE_DRIVE]
type = crypt
remote = GOOGLE_DRIVE:GOOGLE_DRIVE_ROOT
filename_encryption = standard
password = REDACTED00
password2 = REDACTED01

[GOOGLE_DRIVE]
type = drive
client_id = REDACTED06
client_secret = REDACTED07
token = {"access_token":"REDACTED08","token_type":"Bearer","refresh_token":"REDACTED09","expiry":"2023-02-16T12:18:56.541006812-03:00"}
root_folder_id = REDACTED10

A log from the command with the -vv flag

This is basically impossible due to the low frequency of these errors and the very very large debug output that would be produced (I probably don't have the disk space to store that much debug output!)

Without the debug log, those look like network errors I'd imagine.

What's the goal for generating those sums as if you reupload them again to a crypt, they'll change and the sums won't match local sums?

These errors are caused by corrupted blocks in the crypted data.

Exactly where the corruption comes from it is hard to say it could be

  • file corrupted on drive - in which case it would do the exact same thing every time
  • file corrupted in rclone (unlikely as go is a memory safe language but not impossible)
  • memory corruption in your computer
  • network data corruption that somehow evaded TCP and ethernet checksum detection

Do these files do the same thing every time or do they only sometimes give the corrupted error?

I don't know what that means! Did a file change?

That is probably a race condition between the token renewer and the hashing. Could be clock skew between you and Google maybe?

I think there was some words missing here :slight_smile:

I would run memtest86 on your computer for 24 hours just to rule out memory problems - they are a lot more common that you might think!

1 Like

these sums are being calculated on the source ENCRYPTED_GOOGLE_DRIVE: remote, with the intention to verify its copy to another Encrypted Google Drive by using rclone md5sum -C. If anything changes, it will mean copying errors that I will have to investigate and fix.

If you have two encrypted remotes or even the same one, a file being reuploaded or copied to another, the md5sum changes each time.

That's why I was trying to understand what the flow was.

Two copies of the same file:

2023/03/03 16:01:12 DEBUG : hosts: Sizes differ (src 281 vs dst 510)
2023/03/03 16:01:13 DEBUG : hosts: md5 = 856528923ca12ece54ad1c1cbd1efd51 OK
2023/03/03 16:01:13 INFO  : hosts: Copied (replaced existing)

and

2023/03/03 16:07:28 DEBUG : hosts: md5 = 1f54a6cc2106b48c3c20465ab3733859 OK
2023/03/03 16:07:28 INFO  : hosts: Copied (new)

It's actually a VPS, but the underlying hardware is AMD EPYC and while it's theoretically possible to run these with non-ECC RAM, it's not very plausible -- and I specifically requested this VPS to be located in ECC-equipped hardware.

I'd have to pay extra in order to be able to mount a non-VPS-provider image on this VPS, and in my experience Memtest86 gives a lot of false negatives (ie, fails to show memory errors in known-bad machines). I used to run mprime95 concurrently with dledford-memtest on my own (physical) hardware and IME they always succeeded in detecting bad hardware, very often in hardware that passed Memtest86 with no errors

I just became aware of this: GitHub - stressapptest/stressapptest: Stressful Application Test - userspace memory and IO test; it's recommended on the Archlinux wiki page on Stress-testing[1] as a "synthetic" test, meaning "explicitely designed to torture the hardware as much as possible" (that same page classified Memtest86 as a "light" test). I just installed and am running it, let's see if it flags any issues in the next 48 hours or so.

[1] Stress testing - ArchWiki

You're correct, I started this paragraph meaning to complete this phrase with the results of some additional tests but never got to them, and then forgot to delete the incomplete phrase :man_facepalming: Please accept my apologies for the wasted time.

The rclone md5sum is now about 45% done; in about 2-3 weeks more, it should finish and then I will try to generate the md5sums for the affected files individually, I will keep this topic posted on the results.

The flow is:

  1. rclone (...) copy ENCRYPTED_GOOGLE_DRIVE: COPY_OF_ENCRYPTED_GOOGLE_DRIVE: (already finished)

  2. rclone (...) md5sum ENCRYPTED_GOOGLE_DRIVE: (still running)

  3. rclone (...) md5sum -C (...) COPY_OF_ENCRYPTED_GOOGLE_DRIVE: (to be run immediately after #2 above).

IME only files with copy problems should be flagged by #3.

That command doesn't work though:

rclone md5sum gcrypt:
2023/03/03 17:08:24 ERROR : two/one/hosts: hash unsupported: hash type not supported
2023/03/03 17:08:24 ERROR : one/one/hosts: hash unsupported: hash type not supported
2023/03/03 17:08:24 ERROR : hosts: hash unsupported: hash type not supported
2023/03/03 17:08:24 Failed to md5sum with 6 errors: last error was: hash unsupported: hash type not supported

Are you running something else?

It does if you pass the option --download as I depicted in my original post (I've ellipsed it on the 'pseudo-command' I posted last in order to keep things simple).

I had no idea what you "..."ed out as I saw the original command and was trying to compare as I try to not assume things..

So you are downloading the files, recalculating the md5sums for them.

Got it.

Good luck!

Not me, rclone :slight_smile:

Actually, just plain "calculating" -- no "re" as there was no previous calculation for most of them. For some I do have separate ".md5" files in the same directory as the files and calculated in the host system before the directory was uploaded to the remote. I will eventually check those ".md5" files, but right now my priority is to do a general check in my new (destination) account ASAP before my old (source) EDU account bites the dust for good -- it's already in "read-only" mode.

Thanks! Looks like I'm going to need it -- I've been seeing a general "shitification" of Google in general and Drive service specifically, along the last few years and accelerating as of late, so I just hope I can take my data out of there before it completes that process :wink:

Rclone does what you ask it to do as it's not sentient yet..

Semantics but on the remote, they have md5sums and they were calculated when uploaded, compared and stored so you are recalculating but we're splitting hairs at this point as I see your point of view :slight_smile:

I'm super insanely happy I moved to Dropbox. I'd pay a bit more for their service if it goes up as no quotas and it just generally works all the time for me. I figured at some point Google would catch up and enforce things but the quality of the service with the posts I've been reading lately is just so bad.

1 Like

I'm trying to avoid doing other things with this remote in parallel to this rclone md5sum to so as not to 'rock the boat', but I just did an rclone cat for some of these files and they all aborted with the same exact error -- so these are probably permanent errors.

This is looking more and more to be the case.

And all these files were checked when they were first put into Google Drive, about 6 years ago -- therefore the corruption happened while they were stored in Google Drive.

It seems Google Drive has been quietly 'eating' my data... haven't destroyed much so far ('only' 19 files in the ~6.5M md5sum'ed so far is about 0.0002%), but this is one more incentive to get my data the F out of there ASAP.

I don't know what that (Bad Request, failedPrecondition) means! Did a file change?

I was pretty sure this particular file has not changed since I stored it many years ago, and I'm even more sure now after checking the file's modTime in the subjacent remote:

rclone cryptdecode --reverse ENCRYPTED_GOOGLE_DRIVE: REDACTED20/REDACTED21/REDACTED22.JPG
    REDACTED20/REDACTED21/REDACTED22.JPG        REDACTED101/REDACTED102/REDACTED103

rclone lsl GOOGLE_DRIVE:REDACTED10/REDACTED101/REDACTED102/REDACTED103
       8743504 2017-07-09 16:02:03.000000000 REDACTED103

That is, Google Drive itself reports the (base) file as unmodified since 2017... so we can affirm that the file has not changed during the execution of the above rclone md5sum that reported this failedPrecondition error.

The good news is that I just tried calculating that file's data again:

rclone md5sum --download ENCRYPTED_GOOGLE_DRIVE: REDACTED20/REDACTED21/REDACTED22.JPG
    9de005985fda7216b4985d2a37f57f90  REDACTED22.JPG

So the error isn't permanent. Perhaps just some piece of crap stuck in one of Google's 'tubes' but has since then got loose?

Clock skew is very improbable on my end, as:

  1. the machine runs ntpd
  2. This ntpd has been in sync since December 10th; from this time and the time the error occurred (Feb 19th) there were no logs of anything going on with ntpd, so it almost certainly remained in sync (yes, I log and save everything that happens at this machine at syslog level DEBUG).

Therefore the only chance for any clock skew would be in Google's end -- which I don't think is significantly probable either.

The good news is, just like the other single error above, this one also seems to have been transient:

rclone md5sum --download ENCRYPTED_GOOGLE_DRIVE:REDACTED30/REDACTED31/REDACTED32/REDACTED33/REDACTED34/REDACTED35/REDACTED36/REDACTED37/REDACTED38.pdf
     0fe0e04974dcf14a65e3e67375a1fa84  REDACTED38.pdf

So, yay! :wink: It seems Google has indeed only eaten 19 of my files so far :frowning:
(all of which I also have on two different local disks -- so they can easily be restored)

When this is over, I will come back and report a final tally (and also post a "PSA: Google Drive is silently corrupting files!" topic to warn the unwary).

First of all, I apologize if I sound like I'm splitting hairs -- I'm only trying to leave the record straight for those that will eventually come a-googling. \

What I meant is, the MD5 calculations are being done by rclone and not by me -- as it would be if I were downloading the files and calculating it by myself eg with rclone cat FILE | md5sum - as I've done a lot of times in the past -- including before rclone md5sum was implemented (yes, I've been using rclone for that long).

IMHO "Semantics" is very important, as it concerns the meaning itself of the words -- but I agree with you, let's move any further discussion about this to PMs as I think the record is straight enough by now.

You know, you are my personal hero re: alternatives to Google since, a few months back, you pointed me towards Cloudflare email routing -- which has been working just about perfectly for me since then. So your recommending Dropbox carries a ton of weight for me. OTOH, I've been reading all sorts of things about Dropbox -- and some seem to indicate it's not really 'unlimited' anymore. But this getting too much off-topic -- I will create a new topic for that and tag you there.

Thanks!

There are still some of us who don't have any issues :wink:

1 Like

I'm glad to hear it, and I hope it continues to work well for you. Just be aware that the ox in the slaughterhouse line, when another ox two or three places in front of it gets felled by the hammer, could also be thinking just about the same thing, ie "nothing has happened to me so far, so no reason to worry!" :slight_smile:

Hmm, that is very bad.

How big are these files (roughly)?

The crypt format is broken up into 64k chunks and I have a version of rclone which will write zeroes for the chunks with errors, but otherwise carry on, so we could investigate how much of the file is corrupted if you want.

If it is something like a video file then a 64k chunk lost will cause a minor visual artifact. If it is something less resilient then a 64k chunk missing might make the whole file useless.

Good news on the retries.

I look forward to the final score!

Here's a sorted list of the sizes for these 'bad password' files so far (21 files now, up from 19 as 2 more have cropped up since my last message): http://durval.com/xfer-only/20230306_rclone_bad_password_file_sizes.txt

So they are sized from ~9.5MB all the way to ~10GB, with about half of them over 1GB.

Yeah, it would be nice to have a look at these files and try to understand what happened. For all of them so far I have local copies so I can cmp -b the F'ed version against the good one and have an idea as to the extension/location of the damage.

If it is something like a video file then a 64k chunk lost will cause a minor visual artifact. If it is something less resilient then a 64k chunk missing might make the whole file useless.

Here's a | sort | uniq -c list of their extensions:

  1 avi
  2 gz
  1 JPG
  3 mkv
  5 MOV
  1 mp4
  1 pdf
  6 slob
  1 vmdk

So, 10 of them are video files which should be mostly viewable (as long as the error areas don't include any critical headers or markers within the files), and the others are less resilient formats which would be completely unusable if I didn't have local copies for them.

I will be sure to keep this topic posted!

And BTW, thanks again for all your great help, and for making rclone available in the first place.

I just ^C'ed stressapptest after a little over 67h running, and the result was:

^CLog: User exiting early (1409823927 seconds remaining)
Stats: Found 0 hardware incidents
Stats: Completed: 5766200320.00M in 241479.67s 23878.62MB/s, with 0 hardware incidents, 0 errors
Stats: Memory Copy: 5766200320.00M at 23878.65MB/s
Stats: File Copy: 0.00M at 0.00MB/s
Stats: Net Copy: 0.00M at 0.00MB/s
Stats: Data Check: 0.00M at 0.00MB/s
Stats: Invert Data: 0.00M at 0.00MB/s
Stats: Disk: 0.00M at 0.00MB/s

Status: PASS - please verify no corrected errors

So, I think it's reasonably safe to assume the current machine isn't experiencing any sort of memory corruption.

This has a new flag --crypt-pass-bad-blocks which will output blocks which couldn't be authenticated as 64k of 0s.

Its possible if bytes have been added/removed to the file that it will output an endless stream of errors, but hopefully it will just be a small number of blocks.

v1.62.0-beta.6770.65ab5b70c.fix-crypt-badblocks on branch fix-crypt-badblocks (uploaded in 15-30 mins)

Looks good to me :slight_smile:

1 Like