Prevent sync of identical file with different time stamp

What is the problem you are having with rclone?

Our destination (a standard Linux XFS file system) contains many trees with files that are identical (in name and content) to those in other trees. Once rclone sync is run we use the hardlink command to consolidate all duplicates into a single physical file that is hard-linked across all trees. This is required and saves a HUGE amount of space. After hard-linking, the time stamps and permissions of these files all become identical (the last one "wins") and need to be considered irrelevant in the context of another sync from the (sftp) source. The files on the source are all physically separate (but still identical), however their time stamps can all be different. This causes them to be copied unnecessarily when the next sync is performed. How do we prevent this?

If the sizes of the source and destination files are different, then obviously the file needs to be copied. But if they are the same, is there an efficient way to tell if their contents are identical and thus avoid a copy operation? Some of the files are very large, and the unnecessary copy wastes a lot of time and bandwidth.

Run the command 'rclone version' and share the full output of the command.

rclone v1.57.0-DEV
- os/version: redhat 8.6
- os/kernel: 4.18.0-372.52.1.el8_6.ppc64le (ppc64le)
- os/type: linux
- os/arch: ppc64le
- go/version: go1.16.12
- go/linking: dynamic
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

No cloud storage. The source is Red Hat's business partner Linux system accessed via SFTP.

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone sync --log-file=logfile --log-level=INFO redhat:our_hashed_partner_dir /our/local/destination

The rclone config contents with secrets removed.

[redhat]
type = sftp
host = sftp.connect.redhat.com
user = our_user_name
key_file = /root/.ssh/rh_ecdsa

# Work around the fact that the sftp backend doesn't (yet) support --links
# Hopefully https://github.com/rclone/rclone/issues/5011 will fix this problem.
skip_links = true

md5sum_command = none
sha1sum_command = none

A log from the command with the -vv flag

I don't have such a log available, but the log I do have contains a large number of entries of the form:

2023/07/12 15:37:10 INFO  : path/to/file: Copied (replaced existing)

and a tail-end summary that typically looks like:

Transferred:       38.816 GiB / 38.816 GiB, 100%, 1.642 MiB/s, ETA 0s
Checks:           1186055 / 1186055, 100%
Transferred:        14551 / 14551, 100%
Elapsed time:     37m39.3s

As you can see, a very large amount of data is transferred when very few files on the source (if any) have been changed. I might be able to report back later with a -vv log run against a subset of the trees if needed.

Yes - checksums. Enable checksums on your sftp server and then use:

rclone sync src dst --checksum

Please note that if checksum is not available (like in your config) only size will be used:

--checksum Skip based on checksum (if available) & size, not mod-time & size

but using size only is obviously a bit risky

ON the sftp server or FOR it (as in my local config)? I do not control the sftp server, so if something has to be reconfigured there it could require some negotiating with the administrators. If this is the case, do you have any idea what I should ask for exactly? Will rclone tell me whether or not such a capability is enabled on the server?

Thanks for your help.

All in the link I included:

SFTP does not natively support checksums (file hash), but rclone is able to use checksumming if the same login has shell access, and can execute remote commands. If there is a command that can calculate compatible checksums on the remote system, Rclone can then be configured to execute this whenever a checksum is needed, and read back the results. Currently MD5 and SHA-1 are supported.

sftp operator has to allow your session to execute hashing command like md5sum or sha1sum

checksums are the most reliable way to compare if two files are identical.

modtime and size less safe but usually acceptable

size only - a bit tricky... but without checksums in your situation it is only option remaining. I would never use it for important data,

The best option would be that your sftp provider can configure it.

If not you can try rclone hasher overlay. It will keep track of every transferred file and remember its hash in local database. Unless you are willing to re-read all data (or ask sftp guys to provide you will all hashes and load them locally) from remote to populate database benefits will be incremental. Every new file you upload (or read) will have hash - so in time you will see less and less false uploads.

And one afterthought. For big datasets sftp is simply wrong tool for a job if you have to sync frequently. There are no good choices really how to make sync reliable and efficient. There are much better solutions like S3 based storage. I know that sometimes there is no choice and people have to patch bits and pieces to make things work. The biggest problem with sftp is that it does not scale well - if your data will keep growing you will face more and more challenges.

Thank you for the help. I've concluded that since we've already decided that the time stamps of the destination files are "not important enough" to retain in favor of hard-linking the files to save space (in some cases there are hundreds of identical files in the tree, so the space saved is significant), and that rclone was basically "thrashing" in an attempt to re-copy each file whose time stamp was not a match (every time such a file is re-written, the time stamp of all the others is changed due to the hard linking), I introduced the --update option into the command instead of trying to use --checksum. After three or four runs of rclone sync --update, all instances of the affected files eventually assumed the time stamp of the latest instance, and no further redundant copies are occurring. Given the nature of the source tree, I don't expect any files to be changed without their time stamps also being renewed, so I think this is the best solution for us. Thank you for all your help understanding the --checksum option, even though I decided to pursue an alternate solution.

1 Like

I agree that sftp has its problems, but the choice is out of our hands. Red Hat owns the source systems and they decided to remove legacy ftp connections in favor of sftp, and our previous use of the rsync command is no longer possible with the current environment. I actually moved to the rclone command (I was not previously aware of it) after reviewing various other suggested commands with sftp support and not being happy with any of them. The rclone command works great for us, except that the --links option in the sftp backend is not yet implemented. Hopefully somebody will find a way to address issue
sftp: please add support for symlinks · Issue #5011 · rclone/rclone · GitHub and get this working. Thanks!

1 Like

I suppose other commands over the SSH system could help with native symlinks. I wonder if there's any utility in supporting good ol' tar over the link.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.