Introducing checksum to completed sync due to SnapRaid?

Nicoloks · May 30, 2020, 6:37am

Hi All,

So did something a bit daft. A SnapRaid script I have was whining about there being heaps (as in over 100k) of files with a zero modification time. Without thinking I foolishly ran the SnapRaid touch command to put a timestamp place holder on these files so SnapRaid could better determine if files were new or modified. As a result my rclone script is now trying to upload multiple terabytes worth of data back up to my GoogleDrive for data that still resides on my NAS.

From my take, this is going to be an ongoing consideration for using SnapRaid on my NAS with rclone as by cloud backup solution. Two questions really;

Would using the checksum option with my sync command remove the dependence on filesize and modification time checks for rclone?
Assuming yes to above, how would I go about getting the checksums in place for files already on my GoogleDrive? I'm hoping I can start doing this without having to re-sync 28TB or so of data.

calisro · May 30, 2020, 12:49pm

3 main syncing methods

no flags - (size, modtime)
--size-only (size)
--checksum (size, checksum)

modifiers

--ignore-size makes all of the above skip the size check
--ignore-times - uploads unconditionally (no checks)

You can just use checksum and it should be fine. Drive already stores the checksums by default.

ncw · May 30, 2020, 5:17pm

How do the files differ? Rclone should tell you if you run the sync with -vv.

Nicoloks · May 31, 2020, 6:02am

Thanks guys,

I'm even more lost now. I was sure it was the SnapRaid touch that was causing this. Anyway, ran my sync script with the --dry-run & -vv switch and it found 376,000+ files with modified timestamps, however they were all under the 1ms threshold time.

Only one log entry I could see of interest was at the end;

2020/05/31 14:14:26 DEBUG : Encrypted drive 'GCrypt:/SecBackup': deleted 5451 directories

I'll run my backup again as normal tonight and see what happens, from my dry run there was nothing I saw which was flagged for upload.

ncw · May 31, 2020, 10:14am

Note that rclone fixes timestamps without having to re-upload the file on most cloud providers...

Nicoloks · May 31, 2020, 11:25am

Thanks again for the assistance. I'm really not sure what is going on here, it seems as if rclone believes a load of files have been deleted. I use the --backup-dir option with sync to ensure nothing ever gets truly deleted from the cloud side. Below is my complete command line I've been running on the same Windows 2012 server for almost spot on a year now.

rclone.exe sync M:\ GCrypt:/SecBackup --backup-dir=GCrypt:/SecArchive/%date:~-10,2%-%date:~-7,2%-%date:~-4,4% --buffer-size 16M --drive-chunk-size 64M --transfers 40 --checkers 2 --tpslimit=5 --bwlimit 3.5M --exclude-from .\excludes.txt -P --stats-log-level INFO --log-level DEBUG --log-file=.\log\Rclone_%time:~0,2%-%time:~3,2%_%date:~-10,2%-%date:~-7,2%-%date:~-4,4%.txt

Using the "Get size" feature on Rclone Browser I can see from my archive directories that from the 13th of this month my script started archiving off massive amounts of data. Of the 5 archive folders I can see, just shy of 3.7TB of data has been archived as per the --backup-dir option, however of the several dozen files I've spot checked they are 100% still on my NAS and are definitely no longer present in my main sync directory. Unless I'm mistaken, this means rclone thinks the files have been removed from the source path?

Two things I can think of have happened in this time;

I upgraded my NAS from OMV v3 to v4
I did actually deleted a load (1.7TB max) of media files and 10's of thousands of old email files from my NAS to reclaim some space.

Re-filtering my dry run log from earlier I can see there seems to be a ton of entires for moving and copying files. A few things I notice;

At the very top of the log I can see all the new data that has been created on my NAS that has yet to be sync'd. These all have a single entry with a status of "Not copying as --dry-run"
There are 212k entires in total, 121k are for files that exist in directories that have been sync'd for some time and from what I can tell still exist on my NAS. I absolutely would never intentionally delete these files as they are 15yrs+ worth of photos and video from trips and family holidays and the like.
Looking through these 121k of files I see there are 2 entires for each file. First one says "Not moving as --dry-run", followed by "Not copying as --dry-run".
Scrolling towards the bottom of the log I see all the entires for the files I did actually delete from my NAS which have a status of "Not moving into backup dir as --dry-run"

Any ideas on what I may have done to cause this?

ncw · May 31, 2020, 9:06pm

Did you rename a directory that had the files in? Or maybe it appears to have a different name now? That might do it.

You can try adding --track-renames to your command line then rclone will rename files that have been moved if that is the case.

Nicoloks · May 31, 2020, 10:15pm

Thanks, I'll give it a try. Do I need to have have --checksum also in place? I haven't used --track-renames, sounds like something I really should have in place regardless given the number of files I am syncing.

Edit
So I added in the --checksum option and I am seeing this log entry in a dry run.

NOTICE: Encrypted drive 'GCrypt:/SecBackup': --checksum is in use but the source and destination have no hashes in common; falling back to --size-only

Is there anyway I can get rclone to generate the checksum on local and remote without having to upload again?

Animosity022 · June 1, 2020, 2:38am

You cannot do a checksum against a crypt remote as it's encrypted so the file hash is different each time.

You can use cryptcheck but that doesn't really hit your need though.

Nicoloks · June 1, 2020, 5:05am

Thanks for that clarification. I updated to rclone 1.52 and any currently doing a dry run with --track-renames set with the --track-renames-strategy set to modtime rather than using hash. Will see how it goes.

ncw · June 1, 2020, 8:49am

Sorry I forgot you were using crypt... You'll also need the --track-renames-strategy modtime flag so rclone checks the renames using the modtime rather than the hash.

Oh, I just seen you've found that for yourself!

Nicoloks · June 1, 2020, 11:28pm

Thanks everyone again.

Not sure if it was a change from upgrading rclone 1.51 to 1.52, or from running the track changes option. Regardless, in the last dry run I ran there were a LOT of entires showing a modified time greater than the 1ms rclone default threshold. I found this on the SnapRaid FAQ page;

Sets arbitrarely the sub-second timestamp of all the files that have it at zero.

This almost puts me back to my original thoughts of all these files being archived off because of the SnapRaid touch command. Supporting this is that from my last dry run, all these new greater than 1ms mod times were under a 1s.

I have now added --modify-window=1s to my sync script along with the track-changes option and set a backup going last night. So far after 12 hrs it has sync'd 144GB of which 439 photo related files totalling 1GB that have been archived that I am not convinced yet should have been. I do notice that the sync script has reported 439 renamed files which I am wondering if is connected. I'll need to look into the logs more closely in 4 days or so when it has completed the 1.185TB in the queue.

Question about rename tracking though. I take a LOAD of photos doing time lapse sequences, as in 10's of thousands. I have several Sony cameras which all use the same naming convention of DSC12345. RAW files from these cameras will always be identical is size and eventually they will cycle around and I will have files with the same name and filesize. Mod time should still be different, is there any significant risk of just relying on mod time in this scenario?

ncw · June 2, 2020, 8:01am

OK that makes perfect sense as a root cause of the problem - setting the sub second times would cause rclone to think the files have changed.

You are relying on the size and the modtime. So the files would have to be the same size and the same modtime to get confused. Rclone is ignoring the leaf name here.

Track renames only comes into place if you do actually rename or delete stuff - the way it works is that normally rclone would delete things that are surplus on the destination, but when using track renames it keeps a note of all the things it would delete and then tries to match them up with incoming files to save transferring them. So if you aren't in the habit of deleting stuff then there won't be any opportunity for track renames to do anything.

I suppose I'm just trying to say that it is only renamed/deleted files which will be checked and rclone checks the size and the modtime.

Do you have photos which might be the same size? What about the same modtime?

Nicoloks · June 12, 2020, 6:00am

ncw:

Nicoloks:

Sets arbitrarely the sub-second timestamp of all the files that have it at zero.

This almost puts me back to my original thoughts of all these files being archived off because of the SnapRaid touch command. Supporting this is that from my last dry run, all these new greater than 1ms mod times were under a 1s.

OK that makes perfect sense as a root cause of the problem - setting the sub second times would cause rclone to think the files have changed.

Nicoloks:

Question about rename tracking though. I take a LOAD of photos doing time lapse sequences, as in 10's of thousands. I have several Sony cameras which all use the same naming convention of DSC12345. RAW files from these cameras will always be identical is size and eventually they will cycle around and I will have files with the same name and filesize. Mod time should still be different, is there any significant risk of just relying on mod time in this scenario?

You are relying on the size and the modtime. So the files would have to be the same size and the same modtime to get confused. Rclone is ignoring the leaf name here.

Do you have photos which might be the same size? What about the same modtime?

Uncompressed RAW photos from the same camera will always be the same size. Modtime should be different though, about the only scenario I can think this would present an issue is if I were to copy all my data from one NAS to another without preserving timestamps. Even then, the likelyhood of two RAW photos from the same camera (same size) being copied at the same time I'd think is pretty slim.

Anyway, seems to be all sorted now. Lesson being, if you are using SnapRaid you need to set --modify-window=1s in your conf so that the SnapRaid touch command does not send rclone into a spin.

system · June 15, 2020, 6:00am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.