OneDrive sync never completes with SHA-1 corrupted transfer with latest rclone

That's a great procedure @Ole thanks :slightly_smiling_face:

I agree to the procedure and will do a test as soon as possible.

Unfortunately I am experiencing issues with my internet provider today, they have gone into „weekend mode“ and my internet connection keeps dropping again and again.

I’ll guess waiting for a stable connection will make testing more reliable.

1 Like

Hope they fix your internet asap!

@Malz Thinking more about it, you may want to add a filter somewhat like this to the 3 commands:

--max-age=2022-01-01 --min-age=2022-03-31

The intent of --max-age is to reduce the test time by cutting a slice of your data.

The intent of --min-age is to make a (hopefully) stable slice, that doesn’t change between the sync and the two checks; and that will remain stable even if the binary compare is needed.

Hope you can enjoy your weekend without a stable internet, this can easily wait :slight_smile:

Personally I also think this is a Microsoft backend issue.

Over the years I very occasionally see sha1 issues on upload but then they work on retry. But last Sunday I saw 5 of them, and 1 file failed to upload after 3 attempts, and yesterday I saw 2 of them (both succeeded on retry, and those 2 files were under 100Mb in size!).

I have tentative theory that Microsoft have some sort of eventually consistent backend and that the API call to get the checksum might have retrieved a value from the previous upload segment. During periods of heavy upload activity the backends may just take a little too long to sync up.

If the above tests work then I wonder if, on getting a checksum mismatch, it might be worth having rclone pause a short while and retry the checksum, and only retry the whole upload if that fails?

I don't trust uploading large files (and some cloud providers can't handle it, anyway), so I split my files into 5Gb chunks by piping the output of dump into split -b 5Gb - ${file}_ so I end up with things like

20220403._datadisk.0.gz_aa
20220403._datadisk.0.gz_ab
20220403._datadisk.0.gz_ac
20220403._datadisk.0.gz_ad
20220403._datadisk.0.gz_ae
20220403._datadisk.0.gz_af
20220403._datadisk.0.gz_ag
20220403._datadisk.0.gz_ah
20220403._datadisk.0.gz_ai
20220403._datadisk.0.gz_aj
20220403._datadisk.0.gz_ak
20220403._datadisk.0.gz_al
20220403._datadisk.0.gz_am
20220403._datadisk.0.gz_an
20220403._datadisk.0.gz_ao

So, in my case the issue is (normally) only annoying because it (normally) gets fixed on first retry; it's only 5Gb retrying and not 75Gb!

@sweh Thanks a lot, good to know and a very likely explanation, we will keep an eye on that.

If your hypothesis is correct, then we will be unable to find any (unexplainable) checksum errors in the above test, but may see the issue again if we remove --ignore-checksum in the sync (and perform it during local office/peak hours).

@automaton82 Do you still have the full log used for this extract?

If so, then I would like to see if you experienced three consecutive checksum errors on the exact same file. I don’t need the full log file, just the lines with "ERROR :" - that is something like grep "ERROR :" yourLogFile

I suspect that is unlikely to work as if I were writing the code I'd cache the sha1sum in the object metadata as it is quite expensive to calculate. So my guess would be if it is wrong once it will always be wrong. We'll see!

Right, but the backend may not be immediately consistent. So we do a "push final segment" call, then do a "get Hash" call and if those two calls get load-balanced to different backend servers then the second call might retrieve stale data... but some seconds later it might get into sync, or we might hit a server that's already been synced.

And today was even worse.

2022/04/03 12:45:19 ERROR : penfold/20220403._FastData.0.gz_ag: Failed to copy: corrupted on transfer: sha1 crypted hash differ "e43bee9ac8c35a64b7cc3c1ad62ac3014b12841c" vs "c33bffe167ee08f8d85d621dce5d776387a41745"
2022/04/03 12:52:30 ERROR : penfold/20220403._FastData.0.gz_ai: Failed to copy: corrupted on transfer: sha1 crypted hash differ "36c70e34d287bb1f2c37395d129ffe8d41c8a2fd" vs "6a6c54a5d59374a154a1f5eb5a9a72dce2cc26df"
2022/04/03 12:54:00 ERROR : penfold/20220403._FastData.0.gz_aj: Failed to copy: corrupted on transfer: sha1 crypted hash differ "26df6d138abb93e599be919f18f3ae503c8d30bb" vs "04aa84287a4ad134e6f281744e126eb4fba3c9e2"
2022/04/03 12:57:04 ERROR : penfold/20220403._FastData.0.gz_al: Failed to copy: corrupted on transfer: sha1 crypted hash differ "91835069dd197b79c3fc80eb1996f8a8c3e8a9dc" vs "19c425a9f997f616a0232ec8aee76665da2af21b"
2022/04/03 13:13:37 ERROR : Attempt 1/3 failed with 4 errors and: corrupted on transfer: sha1 crypted hash differ "91835069dd197b79c3fc80eb1996f8a8c3e8a9dc" vs "19c425a9f997f616a0232ec8aee76665da2af21b"              
2022/04/03 13:24:40 ERROR : penfold/20220403._FastData.0.gz_aj: Failed to copy: corrupted on transfer: sha1 crypted hash differ "666c89a04e43bef8502d7c27198da3f5fc72f5e5" vs "ba81222e40b8509b554697f9b03b9b4c59c80921"
2022/04/03 13:24:55 ERROR : penfold/20220403._FastData.0.gz_ag: Failed to copy: corrupted on transfer: sha1 crypted hash differ "fe9c8b9ebaf512cac3a8ef74bcf5b5a5e28981ab" vs "5008bfd43fa526566d221897d1d6bf7cbf2096f2"
2022/04/03 13:25:08 ERROR : Attempt 2/3 failed with 2 errors and: corrupted on transfer: sha1 crypted hash differ "fe9c8b9ebaf512cac3a8ef74bcf5b5a5e28981ab" vs "5008bfd43fa526566d221897d1d6bf7cbf2096f2"              
2022/04/03 13:35:59 ERROR : penfold/20220403._FastData.0.gz_ag: Failed to copy: corrupted on transfer: sha1 crypted hash differ "d976cd211051bbb71961716aa6786f98c6bfcd71" vs "1741774b18a839226697804c6b7e4186ac6ccbdb"
2022/04/03 13:36:11 ERROR : penfold/20220403._FastData.0.gz_aj: Failed to copy: corrupted on transfer: sha1 crypted hash differ "79abd71a54a5d852d539f604a98b72b737ad4e87" vs "f7a8a362f3323ef585e53f3450c9165fdbc54fd6"
2022/04/03 13:36:11 ERROR : Attempt 3/3 failed with 2 errors and: corrupted on transfer: sha1 crypted hash differ "79abd71a54a5d852d539f604a98b72b737ad4e87" vs "f7a8a362f3323ef585e53f3450c9165fdbc54fd6"              
2022/04/03 13:36:11 INFO  :
Transferred:      199.592Gi / 199.592 GiByte, 100%, 31.390 MiByte/s, ETA 0s     
Errors:                 2 (retrying may help)           
Checks:              2327 / 2327, 100%
Deleted:               18 (files), 0 (dirs)
Transferred:           47 / 47, 100%                    
Elapsed time:   1h57m22.0s

(I'm guessing the local checksum is different each time because it's a crypt remote).

Of course the "corrupted" file gets deleted, so I can't see what Microsoft really thinks the sum is.

% rclone ls OneDriveBackups:penfold | grep 20220403._Fast
...
5368709120 20220403._FastData.0.gz_ah
5368709120 20220403._FastData.0.gz_ai
5368709120 20220403._FastData.0.gz_ak
...

@sweh Interesting, I guess you can troubleshoot and test your theory with something like this:

rclone sync … --ignore-checksum --ignore-times --retries=1
rclone cryptcheck …
rclone check … --download

You can make this function early return to test that

I keep wondering whether rclone should be doing this at all though. Maybe it should be a flag?

@ncw I would (naively) expect the file to be visible in the OneDrive recycle bin (or in the version history).

Is it deleted before being "committed" - or am I missing something?

It is deleted after it is committed so should be in the recycle bin or version history if that is how you've got things configured.

@sweh Based on the above it seems like you will be able to find the "corrupted" files in your OneDrive recycle bin using the web interface - unless you have business account with a hard delete setting that I am unaware of.

You can probably narrow the search significantly because you know the exact time of deletion from your rclone log. I guess your filenames are encrypted, so you may need rclone cryptdecode to pinpoint the right file(s).

I'm not encrypting filenames for this remote 'cos I wanted it to be easy to restore files in case of an issue. So I can find the deleted object easily enough. But I need to be careful 'cos a restore could overwrite the (now correctly sync'd) good copy of the file.

OK, I now have a mess in my backup area :slight_smile:

$ rclone ls OneDriveBackups:penfold | grep 20220403._Fast.*ag
5368709120 20220403._FastData.0.gz_ag
5368709120 20220403._FastData.0.gz_ag (1)
5368709120 20220403._FastData.0.gz_ag (2)
5368709120 20220403._FastData.0.gz_ag 1

Let's copy a couple of them... Start with the known good one

$ rclone copy OneDriveBackups:penfold/20220403._FastData.0.gz_ag . 
$ sha1sum 20220403._FastData.0.gz_ag /BACKUP/penfold/20220403._FastData.0.gz_ag
3b6d0ccff3720afab3cf95a31109ab7617e1e79b  20220403._FastData.0.gz_ag
3b6d0ccff3720afab3cf95a31109ab7617e1e79b  /BACKUP/penfold/20220403._FastData.0.gz_ag

Good.

Let's try a "bad" one...

Well, that's interesting!

$ rclone copy OneDriveBackups:penfold/20220403._FastData.0.gz_ag\ 1 .
2022/04/04 15:04:54 ERROR : 20220403._FastData.0.gz_ag 1: Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:04:54 ERROR : Attempt 1/3 failed with 1 errors and: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:07:13 ERROR : 20220403._FastData.0.gz_ag 1: Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:07:13 ERROR : Attempt 2/3 failed with 1 errors and: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:08:26 ERROR : 20220403._FastData.0.gz_ag 1: Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:08:26 ERROR : Attempt 3/3 failed with 1 errors and: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:08:26 Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?

Let's try another of the bad files!

$ rclone copy OneDriveBackups:penfold/20220403._FastData.0.gz_ag\ \(1\) .
2022/04/04 15:12:14 ERROR : 20220403._FastData.0.gz_ag (1): Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:12:14 ERROR : Attempt 1/3 failed with 1 errors and: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:12:27 ERROR : 20220403._FastData.0.gz_ag (1): Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:12:27 ERROR : Attempt 2/3 failed with 1 errors and: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:12:44 ERROR : 20220403._FastData.0.gz_ag (1): Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:12:44 ERROR : Attempt 3/3 failed with 1 errors and: multipart copy: read failed: failed to authenticate decrypted block - bad password?
2022/04/04 15:12:44 Failed to copy: multipart copy: read failed: failed to authenticate decrypted block - bad password?

And the files are definitely not complete:

$ ls -l
total 8210492
-rw-r--r-- 1 sweh sweh 5368709120 Apr  3 03:09 20220403._FastData.0.gz_ag
-rw-r--r-- 1 sweh sweh 4223336448 Apr  4 15:12 20220403._FastData.0.gz_ag (1)
-rw-r--r-- 1 sweh sweh 4084137984 Apr  4 15:08 20220403._FastData.0.gz_ag 1

So the files are definitely corrupt in OneDrive!

Great testing - thank you.

So it looks like there is a corruption in the file such that crypt found a bad block. (The authenticator is very strong in crypt blocks.)

Can you download the encrypted file which should download OK (I conjecture its SHA1 will match the data), and then see if you can see what the corruption looks like? This will probably be quite difficult, but you have the offset of where it starts as it is the size of the file before it failed. Hopefully it will be ASCII text saying "Onedrive destroyed the data here", but any data which doesn't look completely random will be the corruption.

I guess there could just be data missing which will be impossible to spot.

So the question is, is there any way rclone could have caused this corruption, or are we sure it is a OneDrive bug?

Were these files uploaded straight through with no low level retries do we know?

$ rclone copy OneDriveSweharris2:Backups/penfold/20220403._FastData.0.gz_ag\ 1.bin .
$ rclone copy OneDriveSweharris2:Backups/penfold/20220403._FastData.0.gz_ag\ \(1\).bin .
$ rclone copy OneDriveSweharris2:Backups/penfold/20220403._FastData.0.gz_ag\ \(2\).bin .
$ sha1sum *
5008bfd43fa526566d221897d1d6bf7cbf2096f2  20220403._FastData.0.gz_ag (1).bin
c33bffe167ee08f8d85d621dce5d776387a41745  20220403._FastData.0.gz_ag (2).bin
1741774b18a839226697804c6b7e4186ac6ccbdb  20220403._FastData.0.gz_ag 1.bin

We can see these one match the errors

2022/04/03 13:24:55 ERROR : penfold/20220403._FastData.0.gz_ag: Failed to copy: corrupted on transfer: sha1 crypted hash differ "fe9c8b9ebaf512cac3a8ef74bcf5b5a5e28981ab" vs "5008bfd43fa526566d221897d1d6bf7cbf2096f2"

2022/04/03 12:45:19 ERROR : penfold/20220403._FastData.0.gz_ag: Failed to copy: corrupted on transfer: sha1 crypted hash differ "e43bee9ac8c35a64b7cc3c1ad62ac3014b12841c" vs "c33bffe167ee08f8d85d621dce5d776387a41745"

2022/04/03 13:35:59 ERROR : penfold/20220403._FastData.0.gz_ag: Failed to copy: corrupted on transfer: sha1 crypted hash differ "d976cd211051bbb71961716aa6786f98c6bfcd71" vs "1741774b18a839226697804c6b7e4186ac6ccbdb"

So Microsoft's hashes appear to be correct.

The files are all of the right size (including the "good" version for comparison).

-rw-r--r-- 1 sweh sweh 5370019872 Apr  3 03:09 20220403._FastData.0.gz_ag (1).bin
-rw-r--r-- 1 sweh sweh 5370019872 Apr  3 03:09 20220403._FastData.0.gz_ag (2).bin
-rw-r--r-- 1 sweh sweh 5370019872 Apr  3 03:09 20220403._FastData.0.gz_ag 1.bin
-rw-r--r-- 1 sweh sweh 5370019872 Apr  3 03:09 20220403._FastData.0.gz_ag.bin

So I ran strings on 20220403._FastData.0.gz_ag 1.bin and looked for any string longer than 30 characters long (grep .....). And... well...

C6jzLl8yC0Necw7InUZ9k9vBtR/nwXJd3BPxqpN1mBZ1vY88z2XEnyGn9kbcDxPdLXUIr9OwICiNcf3dWPiPZ1/klN/AvP8n28
....
....
....
ps7zhrsWCdht11NRF2gLHqunFhPL5Fwhm6kbKGwg39e24TZQ++q2xjVIf5pRdeX8FNt7KQceFNZ/79K+G  

That block is around 6934 bytes long all in one long line. It was at offset 248199391.

Similarly 20220403._FastData.0.gz_ag (1).bin had, at offset 16753810, a chunk of around 9961 characters, and 20220403._FastData.0.gz_ag (2).bin had, at offset 252537251, a chunk of 5860 bytes.

I have no idea if there were low level retries. But this looks like it might either have been doubly-encoded base64 by rclone, or else not decoded by OneDrive.

1 Like

That was very smart :slight_smile:

There is approximately 0 chance of that occuring in crypted data (well about 1 in 10^4174) so that is definitely a corruption.

It would be worth looking with a hex editor around that point to see if you can see any other clues - maybe there are some shorter strings like HTTP headers or something like that.

I think we ascertained above that we don't need low level retries to make this problem.

Rclone doesn't encode the data when sending to onedrive - it just sends plain binary data, so I'm struggling to understand where that base64 string comes from.

Onedrive uses big base64 blobs for tokens, and it also uses them for path names when uploading, hence my request to look around the corruption and see what you can see.

The good news is that onedrives sha1's are correct which means that rclone successfully detected a corruption in transfer and retried it.

Correct, based on the logs in this above post:

rclone_encrypted.log contains 2 corruptions prior to any low level retries. (encryption, --transfers=1, corrupted file is 134 MB)

rclone_unencrytped.log contains several corruptions, and some low level retries in between. The low level retries do not affect the corrupted files. (no encryption, --transfers=1, corrupted files are 134 MB)

@Malz may be able to recover the corrupted files from the OneDrive recycle bin, if needed.