The past week or so whenever I rclone copy to googledrive and then instantly a second later cryptcheck the same folder I just uploaded... I get a mismatched hash for one of the files

left1000 · September 21, 2020, 2:35pm

I am using rclone 1.53.1

This "bug" started around the same time I upgraded from rclone 1.51 to rclone 1.53.1. Of course I don't really consider it a bug at all. If anything the hashes were probably always mismatches and now 1.53.1 is better at finding that mismatch?

It is very odd though that I will get one and only one mismatched hash per folder I rclone copy to googledrive.... It doesn't matter if I upload 100 files or 1000 files 100gb or 1000gb... I seem to always end up with one hashes don't match local hash error for one file.

Another thing I personally did that could've triggered this is I started using --bwlimit for my cryptcheck, in the past I only bothered to bandwidth limit my copy command not my cryptcheck command.

Also yes I know running a check right after a copy is pointless because the copy itself performs it's own check... so yeah for YEARS I've done this for NO REASON AT ALL.... but now it's finally doing something so yeah me.

I figured --bwlimit would either work for cryptcheck or it wouldn't have any effect at all, it'd be weird if this "error" was caused by --bwlimit.

I'm not sure how to go about getting a log file for this error though. Because it only happens when I upload a large number of new files to googledrive over a span of many hours. In other words a verbose log file would be absolutely gigantic, and this "error" is not repeatable in teh sense that I can make it happen whenever I want, but roughly once every day or two it will happen, and it's happened now 4or5 times... so I believe it will keep happening, but it might take a dozen hours of uploading to see the error and it will only appear once and when re-run everything will work normally. (It's possible also that I'm ddos'ing googledrive because I have 300megabit internet speed... hence the bwlimit which I'm now setting to --bwlimit 30M but I used to have at --bwlimit 40M)..

edit: and when I say the error isn't repeatable, I mean, it's easily repeatable, if I just re-run the cryptcheck, but if I delete the file off googledrive and rerun both the copy and then the cryptcheck it doesn't reoccur... it seems almost pointless to post an error log for a cryptcheck that I 100% comprehend, what I'm more curious about is the error log on the copy command at the moment the corruption occurs, and how such a thing is even possible, and how copy doesn't catch it with it's post transfer check?

Oh also this is the command:

D:\rclone-v1.53>rclone copy --bwlimit 30M -v --fast-list --drive-chunk-size 128M "D:\Stuff" "cleancachecrypt:D:\Stuff" && rclone cryptcheck --bwlimit 30M --one-way -v --fast-list --drive-chunk-size 128M "D:\Stuff" "cleancrypt:D:\Stuff"

I've been using this command (without the --bwlimit at least) for many years though so I doubt there's anything wrong with it (other than maybe the --bwlimit).

edit2: nevermind, the --one-way cannot be causing the error, it's required for the command to work at all, whoops, I'm tired :-p

ncw · September 21, 2020, 2:40pm

Make sure there aren't any duplicates with rclone dedupe - duplicates can confuse rclone like this.

Can you try downloading a hash mismatch file with rclone and compare it with the original (run rclone md5sum on local, remote and locally downloaded file).

Is it possible that the local file was being modified while rclone was uploading it?

left1000 · September 21, 2020, 2:44pm

There can't be any duplicates unless the rclone copy command creates them because these files all have the date and time in them to prevent such a problem.

Also that's smart, next time I'll run md5sum, although I 100% expect the md5sum is working properly and it's just that the file really is getting corrupted on google's side? I'll have to wait a few days (until I see this issue again) before I can try anything, because the second I submitted my question I stupidly deleted the corrupt file from the cloud and moved on.

The local files are actually moved into a temporary directory to prevent them from being modified, I just left the /temp parts out of the command for simplicity. That said, it's entirely possible windows is accessing my files for no reason without my permission at random times.

asdffdsa · September 21, 2020, 2:48pm

the command are using different remotes?
cleancachecrypt and cleancrypt

left1000 · September 21, 2020, 2:49pm

they are both different and not different

edit: the command cryptcheck cannot be run on a cache at all period. and cleancachecrypt is just a cache of cleancrypt.

asdffdsa · September 21, 2020, 2:50pm

you are using a cache backend?

left1000 · September 21, 2020, 2:52pm

I'm using the depreciated cache from an old rclone version aka
this one https://rclone.org/cache/
Which is apparently depreciated and being phased out. It could be causing the copy error. Although, as I said, it worked for many years and up until rclone 1.53. It's possible downgrading rclone would make this issue go away, although as I mentioned in my OP I worry that this issue always existed, and previous versions of rclone were just not catching it, rather than simply not causing it.

asdffdsa · September 21, 2020, 2:57pm

is there a specific reason you use to use a cache backend as middle man between a local to crypt copy?

left1000 · September 21, 2020, 3:03pm

I want it to improve lsl speeds by keeping a copy of directory information, I'm about 50% sure it works. Even if it doesn't work though I'll probably keep using it, because it's eventually going to be implemented in VFS and I'll swap over to the functional version.

ncw · September 21, 2020, 3:39pm

rclone copy can create duplicates...

Actually I think that is probably not what is happening. I've investigated a few data corruption issues on drive, and I've never found that drive corrupted the file - Google have some smart engineers!

Hmm, I haven't heard of this problem and lots of people run cryptcheck after their copies, so if I had to point a finger I'd guess the cache backend...

left1000 · September 21, 2020, 3:42pm

I am doubting that rclone copy is creating duplicates at the same same moment in time. Not that it can't do it in general. Since in this case, that's the only chance it has to do it.

ncw · September 21, 2020, 7:59pm

If it does create a dupe then I'd expect it to match if you uploaded it in the same batch anyway...

left1000 · September 21, 2020, 8:10pm

Yeah, sorry to dangle the mystery and not be able to react immediately, I tried to get it to happen again today but it didn't. I will 100% remember though if this happens again in the near future I should md5sum the local file, the remote file, and a downloaded copy of the remote file, to verify that any problem even exists in the first place.

Although because I'm a terrible scientist, I've also stopped using --bwlimit with cryptcheck under the superstitious belief it was causing this issue.

left1000 · September 23, 2020, 1:51pm

Interesting,

Failed to md5sum: hash type not supported is what md5sum got on the remote. local md5sum was 1525f0a2f3ee0bf79eee773f20b07d42

cryptcheck said local was "ca2ad3e2234f69de28da5cb3112d006c" and remote was "9620b535ff3c5171f230c3b7b06d9408"

I guess the hash cryptcheck uses isn't md5sum.

Although, this error was fabricated by me, when I accidentally rebooted in the middle of a long copy command, although this copy command was transferring 4 files at once, so there's still no reason for the 1 error, either 0 errors or 4 errors would make more sense.

edit: but clearly cryptcheck DOES use md5sum... I am confused why cryptcheck's local md5sum doesn't match normal local md5sum... and why normal remote md5sum returns error and cryptcheck's remote md5sum returns wrong. although this may have been a case of google simply ignoring partially transferred data during a ctrl-c style shutdown of rclone.

ncw · September 23, 2020, 4:44pm

Can you see if you can download the file with rclone and compare the downloaded and the local file md5sums?

It is possible the file will not download properly.

Yes, should have thought of that, sorry! The crypt backend doesn't support hashes.

It is the hash of the encrypted data.

left1000 · September 24, 2020, 10:27pm

Okay error happened again this time it was 2 errors in the same batch! which is nuts because every single other time it was only ever 1 error no matter how large the batch and this batch of uploads was only 114 files.

cryptcheck says this:

"ef08af11d5b1c77b0a05f7972830debd" vs "5b5639556cb01e33b855a783eb196c3c"

This time I downloaded the file from the cloud and ran md5sum on the local file and the newly local downloaded file from the cloud (that had just been uploaded and failed it's cryptcheck)...

a5a469c33edccbf8cb4a1139204e84fc vs a0ab07bc86f9ce4c2d7609b48e67ef49

I guess that means the cloud copy is corrupt??? Oh and during the download it said "Writing sparse files:" blah blah blah, a message I've never seen before, this is the first download I've done in 1.53, usually I just upload because I'm making a backup I'll almost never need. Does sparse files message matter or mean anything at all?

Oh, and since I ran these m5sum's during the cryptcheck... before it had finished... 2 more errors appeared... and I ctrl-c'd it because I think running md5sum and cryptcheck on my machine at the same time returns 100% errors. AKA I think this is a harddrive access speed issue? perhaps? but even still this makes it 100% a bug. IMO, although, cryptcheck is checking 8 files at a time, I bet there's a flag I could use to check 4 files at a time.... and since going from 8 files to 9 files (since that's essentially what the md5sum did) increased the error rate dramatically... lowering the concurrent files on cryptcheck for me should reduce the error rate. Although still this is likely due to my faster internet? because now that my internet is faster, in fact my internet is about as fast as my hdd speed now.... the cloud side of the hash check for cryptcheck is much faster? which is what allows this "error"?

But wait... how can the cryptcheck error rate be increased by running too much at once... if the downloaded md5sum doesn't match the local md5sum? That would mean it was the copy not the cryptcheck that had the error? And didn't report on it? But... well for corrupt files.... well... the corrupt file is flawless! And by flawless I mean it works, so, it must only be off by some small number of bits? Hmm.

I was 100% sure the cryptcheck was what had the flaw, and I wrote the above based on that, but the md5sum PROVES that copy itself is what has the flaw/bug. So, I'm clueness now. Unless md5sum is doing something silly like including relative path in it's calculations? Because obviously I put the downloaded supposedly identical file in another folder...

Edit: I put the irrelevant nonsense in italics, gonna run --transfers 2 and --checkers 4 from now on to see if that makes the error go away, even though I still don't comprehend the error.

Edit2: my hdd temperature gauge pinged 60C despite normally (even during rclone tasks) being 38-42C... I guess this is because I ran an md5 and cryptcheck simultaneously? Oh and I was running another batch tool that uses a ton of hdd at the same time too, an unrelated one.

Edit3: Current wild guess, the download copy command is what was corrupted due to the insane hdd usage at the time and this entire post is nonsense.

Edit4: Two errors on that cryptcheck, still this same one that has been flawed this entire post, this time with only 4 concurrent checks. I wonder if that's because it's peak internet usage time? Or because my overall hdd usage is/was too high? It's strange that this error has gone from once every couple days, to, infinitely repeatable with just these 100 files trying to complete a single fully valid cryptcheck on.

Edit5: Oh, wait no, these are still probably the same errors from the original copy command, because this part of the alphabet is the part of the alphabet I hit ctrl-c before reaching during the original cryptcheck of the error ridden copy.

Edit6: Finally completed the original command after like 3or4 tries, silly me, I had to delete all the imperfect cloud files to make it work, which proves, at least to me, in this moment, cryptcheck is working fine, copy is actually somehow broken... although... these files still work, even though their md5sum is different, so, what? are they off by like 1 bit? hmm. Could this be a windows10 sata driver issue? I know the sata controller drivers in windows10 are known to be buggy (never had this issue occur on my old computer, I got a new computer about 4 months ago, although this bug only appeared about 1month ago when I upgraded rclone to 1.53... but maybe the copy command has been flawed for 4months and the cryptcheck md5sum has only been sensitive enough to notice this past month?)

Edit7: New set of 100 files, exactly one cryptcheck error. At least it's back to it's only one pattern. Fewer background programs were open this time though so I'm not sure if that theory has merit :-/

left1000 · September 25, 2020, 4:56am

This might deserve a double post. So I just uploaded another 70 files, using the previously discussed command. As I mentioned in edit7. cryptcheck found 1 hash mismatch. So I deleted from the cloud that one file. Re-ran the command.

The upload copy command this time uploaded 1 and only 1 file. The only I deleted in the previous paragraph.

Cryptcheck found two hash mismatches though!!! I complete ran a cryptcheck, it found one bad file, I replaced the one bad file, and the next cryptcheck said two bad files....

This means I've got BOTH copy errors that are supressed/hidden and cryptcheck errors both. Because I cryptchecked a file, cryptcheck said it was fine, then I didn't alter said file at all, ran the cryptcheck again. Only now it wasn't a match...

HDD temperature spiked to 48C though, when 42C is the highest rclone used to ever put it at. So I'm still on the theory that system resources are being taxxed and rclone is lagging as a result and failing to read data.... but... I'm now only uploading 2 at a time and checking 4 at a time instead of the default 4 and 8... And again, none of this was a problem back in rclone 1.51 a month ago....

Edit: Only wild guess I have left is... stop using cache, so I will, gonna run these commands without cache included for a while, see if the error comes back. After that the only idea I think I'd have left would be to downgrade rclone, but, that might not "fix" whatever is wrong, it might just stop notifying me of a problem that still exists. If the error comes back regularly enough I'll start running -vv instead of -v, and I'll start up a log file. The error was so extremely rare though when it first showed up, but now it's so much more common I think finding information in a giant log becomes a lot simpler.

ncw · September 25, 2020, 8:57am

This inconsistency makes me think that the problem is a hardware problem with your computer - the fact that cryptcheck apparently gave two different answers on the same file is hard to explain otherwise.

RAM errors are reasonably common in modern computers. I would run memtest86 on your computer for 24 hours to check your RAM as a first step. RAM errors are at least reasonably easy to fix - buy more RAM!

The problem could also be your hard disk - you can run one of my programs stressdisk to check out your hard disk if the memtest86 runs clean.

left1000 · September 25, 2020, 5:36pm

Actually I've also during this same exact period of time, but not before this period of time had a lot of blue screens of death some of them due to memory management problems.

Although google says that's actually almost always a vram driver error not a ram, error... also I am using XMP ram at 3200mhz instead of the stock 2100 mhz setting for non-xmp ram.

The weird thing though is I got a new computer in may, and all these problems started in august or so. I should definitely run memtest, that's a good suggestion. It is odd though. I've never had ram be stable, and unstable. I've had bad ram sticks before, but never ones that caused a crash once every week or two. Always ones that either wouldn't turn on, or would cause a crash once every minute or two.

Edit: Is there any chance of a -vv log revealing more useful information? It would for example, help, a ton, to know if this was hdd or ram failures.

Edit2: One last glimmer of hope, I'm going to stop using bwlimit at all. This error started to occur almost at the same time as I started using that flag.

Edit3: Extremely small sample size, but got rid of bwlimit and started -vv logging everything, and after another 1000 files or so, no more errors. I guess if this lasts another week or two, I'll just decide Edit2, aka bwlimit was the weak link. If the problem comes back though, it's probably the ram.

ncw · September 26, 2020, 9:37am

Ram is more likely in my experience, but it could be either

I doubt this will help - if it does I'd guess that somehow the access patterns triggers the hardware problem less often.