Copy/sync --checksum --download

wmiles_sgl · January 19, 2024, 9:52pm

Hi,

I'd like to request that rclone check's --download flag be made available to the checksum validation process for rclone copy and rclone sync.

The basic problem I'm trying to solve is a multi-stage synchronization with an SFTP server in the middle that does not permit shell logins (so no checksum support). Something like source --> server --> many remote nodes. The remote nodes use date+size to identify changed files.

Unfortunately, the source in my case is a CI system that creates fresh copies of the various files, consequently with different created/last modified times, even if the contents are identical to the previous ones. File size alone isn't enough to ensure that the contents haven't changed. We don't want to just blindly re-upload, though, since that will change the modification time on the server; and cause all of the many remote nodes to perform unnecessary copies.

Fortunately for me, the source-->server link is on a fast network, so we can tolerate the cost of running rclone check --download to identify changed files without server-side hashing support. Unfortunately I end up having to run a multi-stage process of rclone check, followed by separate rclone copy, rclone delete commands with explicit file lists. It would be very convenient if the comparison operations performed by rclone sync could use all the features of rclone check.

Thanks,
-Will

kapitainsky · January 20, 2024, 4:58pm

Until such functionality exist you could make your life much easier by utilising hasher overlay.

wmiles_sgl · January 22, 2024, 5:29pm

Thanks for the suggestion! I really appreciate that folks from the community step up with deeper knowledge of the tool.

For my use case, I think the hasher overlay wouldn't help much. rclone check can skip the hash check on size differences, where synchronizing the hasher overlay with rclone hashsum would end up downloading everything and hashing anyways. From my read of the documentation, it doesn't look like the hasher overlay will transparently download to check hashes during copy or sync if they aren't available in the database -- but maybe I'm wrong about that? Alas I cannot rely on a persistent local cache, as my use case is for CI scripts which could be picked up and run on any random host in a fleet, and also I can't necessarily trust that the remote end hasn't been inadvertently mucked up by someone else (at least not yet).

kapitainsky · January 22, 2024, 6:14pm

Then use chunker overlay... in a bit creative way to store files' hashes together with files on your remote.

[chunker_remote]
type = chunker
remote = SFTP_remote:
hash_type=sha1all
chunk_size=1P

and interact with your SFTP server only using chunker_remote overlay.

nielash · January 22, 2024, 6:35pm

It can. The trick is to set --hasher-auto-size to a very large value -- larger than your largest file.

Hasher can do this too, when using copy or sync with --checksum. You don't need to run hashsum first if you're using the --hasher-auto-size trick.

kapitainsky · January 22, 2024, 6:41pm

This is very clever trick! Thx for sharing.

nielash · January 22, 2024, 6:43pm

Also, may not be a fit for your use case, but the latest beta of bisync has this feature.
--download-hash

wmiles_sgl · January 22, 2024, 6:56pm

Thank you! Maybe the hasher documentation could be updated to clarify this? As I read it, the Other operations section implied that the hash database will only be updated if a full transfer was explicitly requested, particularly as the hashsum command description explicitly documents how it uses auto_size but it is not discussed for other operations. Was the "other operations" section meant to be taken as a superset of the hashsum behaviour?

kapitainsky · January 22, 2024, 7:00pm

Documentation in open source project.... never ending story and always open for improvements:) Feel free to contribute:)

nielash · January 22, 2024, 7:05pm

I agree -- the documentation is misleading on this point. IMO, it is actually clearer in the code than the docs:

github.com

rclone/rclone/blob/783599114760d09684bc5ed44f4209813d127484/backend/hasher/object.go#L100-L101


      
          	if f.autoHashes.Contains(hashType) && o.Size() < int64(f.opt.AutoSize) {
          		_ = o.updateHashes(ctx)

github.com

rclone/rclone/blob/783599114760d09684bc5ed44f4209813d127484/backend/hasher/object.go#L110-L111


      
          // updateHashes performs implicit "rclone hashsum --download" and updates cache.
          func (o *Object) updateHashes(ctx context.Context) error {

wmiles_sgl · January 22, 2024, 7:30pm

No problem! docs: clarify hasher operation by willmmiles · Pull Request #7589 · rclone/rclone · GitHub

system · March 22, 2024, 7:30pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.