Could rclone support a local cache of checksums to accelerate local/remote comparison?

jediry · December 9, 2016, 10:00pm

First of all, thanks for building this...this is precisely the tool I've been looking for, for backing up my Linux machine to OneDrive!

I think a nice feature would be the ability to generate a cache of checksums for local files, which could be used to accelerate subsequent calls to rclone. This would make the following sequence of operations much faster, by avoiding redundant checksum computations on the local side:

rclone check --checksum cloud:/path /local/path
[user thinks about why the files might be different, and decides what to do]
rclone copy some files
[repeat as necessary]
rclone check --checksum cloud:/path /local/path
[see where we stand now]

I'm thinking that this might be accomplished with a single command-line argument..something like --local-file-checksum-cache=path/to/database/file. If the file is missing, it's created, and in either case is written out when rclone terminates. The checksum DB would cache the file's size and mod-time along with the checksum, and when the checksum is needed for a local file, the cached one is used if the size and mod-time match.

ncw · December 12, 2016, 11:40am

That is a good idea and clearly put.

I think there are some issues on github with similar ideas, but I like the idea of making this a local file system option only - that simplifies things enormously.

Do you fancy making an option for this on github and seeing if you can collect up links to all the other issues with similar ideas?

Ajki · December 12, 2016, 4:56pm

Great idea and it would speed up things a lot.

jediry · December 12, 2016, 6:32pm

Sure, happy to help. But I’m not sure exactly what you mean by “making an option for this on GitHub”…

ncw · December 13, 2016, 10:45am

D*** you autocorrect I meant...

Do you fancy making an issue for this on github and seeing if you can collect up links to all the other issues with similar ideas?

monroe74 · December 13, 2016, 2:39pm

I just want to chime in and say how useful this would be. I use rclone to backup a 2tb drive. Even if only one byte has changed on that drive, it takes over 3 days to run the backup. This is running on a 2007 Macbook that works fine as a backup server, but calculating checksums on that much data is a slow process. Various other backups I run would probably also become about 10 times faster.

Yes. I did this search:

is:issue is:open checksum

Reading through the results, I found these items of interest:

github.com/rclone/rclone

Enhancement: When using --checksum, files that differ in size should be uploaded first.

opened 01:19PM - 13 Sep 16 UTC

closed 05:38PM - 09 Jan 17 UTC

Wowfunhappy

Let's say I'm uploading a large folder to Amazon Cloud Drive. Some files differ …in size, while others are the same size but contain different data. In order to sync properly, I will have to use the --checksum option. I spend some time uploading files, then need to stop the sync for some reason. When I resume, rclone will spend the next couple of hours checksumming the files I've already uploaded (I guess it traverses directories in the same order every time?), thus not making use of my available bandwidth. Meanwhile, other unuploaded files differ in size and will need to be transferred, regardless of their checksum. In order to maximize use of resources and perform syncs as quickly as possible, files that differ in size should always be transferred first. Rclone can checksum other files while the transfer is going on.

Those issues seem related, and I think the design jediry suggested would work well in addressing what has been discussed previously. There was discussion about creating individual md5 files, but it seems much better to just put all the md5 data in one file.

The underlying issue is that Amazon does not support date modified, which means we have to rely on checksum. But repeatedly doing that local checksum work on large, mostly static files seems like a big waste of time. Seems like it should be easy enough to store the checksums locally for future use, as jediry suggested.

jediry · December 13, 2016, 6:54pm

OK . I just logged issue https://github.com/ncw/rclone/issues/949