Copy/sync --checksum --download

Hi,

I'd like to request that rclone check's --download flag be made available to the checksum validation process for rclone copy and rclone sync.

The basic problem I'm trying to solve is a multi-stage synchronization with an SFTP server in the middle that does not permit shell logins (so no checksum support). Something like source --> server --> many remote nodes. The remote nodes use date+size to identify changed files.

Unfortunately, the source in my case is a CI system that creates fresh copies of the various files, consequently with different created/last modified times, even if the contents are identical to the previous ones. File size alone isn't enough to ensure that the contents haven't changed. We don't want to just blindly re-upload, though, since that will change the modification time on the server; and cause all of the many remote nodes to perform unnecessary copies.

Fortunately for me, the source-->server link is on a fast network, so we can tolerate the cost of running rclone check --download to identify changed files without server-side hashing support. Unfortunately I end up having to run a multi-stage process of rclone check, followed by separate rclone copy, rclone delete commands with explicit file lists. It would be very convenient if the comparison operations performed by rclone sync could use all the features of rclone check.

Thanks,
-Will

Until such functionality exist you could make your life much easier by utilising hasher overlay.

Thanks for the suggestion! I really appreciate that folks from the community step up with deeper knowledge of the tool.

For my use case, I think the hasher overlay wouldn't help much. rclone check can skip the hash check on size differences, where synchronizing the hasher overlay with rclone hashsum would end up downloading everything and hashing anyways. From my read of the documentation, it doesn't look like the hasher overlay will transparently download to check hashes during copy or sync if they aren't available in the database -- but maybe I'm wrong about that? Alas I cannot rely on a persistent local cache, as my use case is for CI scripts which could be picked up and run on any random host in a fleet, and also I can't necessarily trust that the remote end hasn't been inadvertently mucked up by someone else (at least not yet).

Then use chunker overlay... in a bit creative way to store files' hashes together with files on your remote.

[chunker_remote]
type = chunker
remote = SFTP_remote:
hash_type=sha1all
chunk_size=1P

and interact with your SFTP server only using chunker_remote overlay.

It can. The trick is to set --hasher-auto-size to a very large value -- larger than your largest file.

Hasher can do this too, when using copy or sync with --checksum. You don't need to run hashsum first if you're using the --hasher-auto-size trick.

1 Like

This is very clever trick! Thx for sharing.

1 Like

Also, may not be a fit for your use case, but the latest beta of bisync has this feature. :slightly_smiling_face:
--download-hash

Thank you! Maybe the hasher documentation could be updated to clarify this? As I read it, the Other operations section implied that the hash database will only be updated if a full transfer was explicitly requested, particularly as the hashsum command description explicitly documents how it uses auto_size but it is not discussed for other operations. Was the "other operations" section meant to be taken as a superset of the hashsum behaviour?

Documentation in open source project.... never ending story and always open for improvements:) Feel free to contribute:)

I agree -- the documentation is misleading on this point. IMO, it is actually clearer in the code than the docs:

1 Like

No problem! :slight_smile: docs: clarify hasher operation by willmmiles · Pull Request #7589 · rclone/rclone · GitHub

2 Likes