Syncing timestamps when using --checksum

The current documentation for the --checksum flag states:

When using this flag, rclone won't update mtimes of remote files if they are incorrect as it would normally.

I'd like to propose a new flag --update-modtime to toggle this behavior.

Current behavior with --checksum:

  • Compare file size and checksum
  • If size or checksum differ, re-upload file
  • If size and checksum match, ignore file

Proposed behavior with --checksum --update-modtime:

  • Compare file size and checksum
  • If size or checksum differ, re-upload file
  • If size and checksum match, compare timestamp
    • If timestamp differs, update timestamp
    • If timestamp matches, ignore file

This would be useful for ensuring that both file contents (as indicated by the checksum) and timestamps are kept in sync.

1 Like

If you remove checksum, it'll compare that and just update the modtime. Is there a reason you are specifically using the checksum flag?

felix@gemini:~$ rclone copy /etc/hosts GD: -vv
2021/06/08 15:31:17 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2021/06/08 15:31:17 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2021/06/08 15:31:17 DEBUG : Creating backend with remote "/etc/hosts"
2021/06/08 15:31:17 DEBUG : fs cache: adding new entry for parent of "/etc/hosts", "/etc"
2021/06/08 15:31:17 DEBUG : Creating backend with remote "GD:"
2021/06/08 15:31:17 DEBUG : GD: detected overridden config - adding "{xDYYC}" suffix to name
2021/06/08 15:31:17 DEBUG : fs cache: renaming cache item "GD:" to be canonical "GD{xDYYC}:"
2021/06/08 15:31:17 DEBUG : hosts: Modification times differ by -560h49m39.623277573s: 2021-06-08 15:30:41.705277573 -0400 EDT, 2021-05-16 10:41:02.082 +0000 UTC
2021/06/08 15:31:17 DEBUG : hosts: MD5 = 38b847bf753f092fa1fac6e5e4018155 OK
2021/06/08 15:31:18 INFO  : hosts: Updated modification time in destination
2021/06/08 15:31:18 DEBUG : hosts: Unchanged skipping
2021/06/08 15:31:18 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:                 1 / 1, 100%
Elapsed time:         1.1s

2021/06/08 15:31:18 DEBUG : 4 go routines active

Yes, in order to compare all checksums. By default, rclone only compares checksums if the timestamps differ.

What's the reason to update the modtime ?

In order to keep the timestamps in sync between local and remote.

I think you may be looking for --refresh-times

https://rclone.org/docs/#refresh-times

That doesn't work with checksum though.

felix@gemini:~$ rclone copy /etc/hosts GD: --checksum --refresh-times -vv
2021/06/10 08:55:12 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2021/06/10 08:55:12 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "--checksum" "--refresh-times" "-vv"]
2021/06/10 08:55:12 DEBUG : Creating backend with remote "/etc/hosts"
2021/06/10 08:55:12 DEBUG : fs cache: adding new entry for parent of "/etc/hosts", "/etc"
2021/06/10 08:55:12 DEBUG : Creating backend with remote "GD:"
2021/06/10 08:55:12 DEBUG : GD: detected overridden config - adding "{xDYYC}" suffix to name
2021/06/10 08:55:12 DEBUG : fs cache: renaming cache item "GD:" to be canonical "GD{xDYYC}:"
2021/06/10 08:55:13 DEBUG : hosts: MD5 = 38b847bf753f092fa1fac6e5e4018155 OK
2021/06/10 08:55:13 DEBUG : hosts: Size and MD5 of src and dst objects identical
2021/06/10 08:55:13 DEBUG : hosts: Unchanged skipping
2021/06/10 08:55:13 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:                 1 / 1, 100%
Elapsed time:         0.4s

2021/06/10 08:55:13 DEBUG : 4 go routines active
felix@gemini:~$ rclone copy /etc/hosts GD:  -vv
2021/06/10 08:55:22 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2021/06/10 08:55:22 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2021/06/10 08:55:22 DEBUG : Creating backend with remote "/etc/hosts"
2021/06/10 08:55:22 DEBUG : fs cache: adding new entry for parent of "/etc/hosts", "/etc"
2021/06/10 08:55:22 DEBUG : Creating backend with remote "GD:"
2021/06/10 08:55:22 DEBUG : GD: detected overridden config - adding "{xDYYC}" suffix to name
2021/06/10 08:55:22 DEBUG : fs cache: renaming cache item "GD:" to be canonical "GD{xDYYC}:"
2021/06/10 08:55:22 DEBUG : hosts: Modification times differ by -32.164842342s: 2021-06-10 08:55:10.399842342 -0400 EDT, 2021-06-10 12:54:38.235 +0000 UTC
2021/06/10 08:55:22 DEBUG : hosts: MD5 = 38b847bf753f092fa1fac6e5e4018155 OK
2021/06/10 08:55:23 INFO  : hosts: Updated modification time in destination
2021/06/10 08:55:23 DEBUG : hosts: Unchanged skipping
2021/06/10 08:55:23 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:                 1 / 1, 100%
Elapsed time:         0.7s

2021/06/10 08:55:23 DEBUG : 4 go routines active
felix@gemini:~$

The idea is you use --refresh-times once to get the remote modtimes in sync, then sync normally.

Not sure if this would work for the OPs workflow.

No, this does not do what I want. --refresh-times only affects the behavior when comparing files with differing timestamps when there is no checksum available. This flag has no effect when using --checksum.

Running a sync by modtime followed by a sync by checksum (or vice versa) will ensure that all files and timestamps are in sync, but this is less efficient than the proposed solution:

Two-pass solution:

  • Walk directory structure twice
  • Check hashes twice for files with differing timestamps, once for other files
  • Check timestamps for all files

Proposed solution:

  • Walk directory structure once
  • Check hashes once for all files
  • Only check timestamps for files with matching checksums

Maybe you could explain your use case for this? The sync routine docs are already very very complicated so I need a good reason for adding a new flag :slight_smile:

This reminds me of the recent talk about commutive and non-commutive (simply put dangerous) sync flags(conditions) feature: implement advanced change detection · Issue #4810 · rclone/rclone · GitHub

My use case is syncing files from source to destination. After syncing I want all files on the destination have the same contents and modification times as the source.

In most cases, the default sync functionality will achieve this effect. However, this does not actually verify checksums for files with matching timestamps, so there's a possibility that some files will be missed. The type of sync I'm proposing can be thought of as a "paranoid" version of the default functionality.

Thank you for clear explanation.

Going back to your original post you wrote

I guess I'd like to think about why rclone does that.

The --checksum flag was introduced following rsync's usage which I've reproduced here:

   -c, --checksum
          This changes the way rsync checks if the files have been changed
          and are in need of a transfer.  Without this option, rsync  uses
          a "quick check" that (by default) checks if each file’s size and
          time of last modification match between the sender and receiver.
          This  option changes this to compare a 128-bit checksum for each
          file that has a matching size.  Generating the  checksums  means
          that  both  sides  will expend a lot of disk I/O reading all the
          data in the files in the transfer (and  this  is  prior  to  any
          reading  that  will  be done to transfer changed files), so this
          can slow things down significantly.

          The sending side generates its checksums while it is  doing  the
          file-system  scan  that  builds the list of the available files.
          The receiver generates its checksums when  it  is  scanning  for
          changed files, and will checksum any file that has the same size
          as the corresponding sender’s file:  files with either a changed
          size or a changed checksum are selected for transfer.

          Note  that  rsync always verifies that each transferred file was
          correctly reconstructed on the  receiving  side  by  checking  a
          whole-file  checksum  that  is  generated  as the file is trans‐
          ferred, but that automatic after-the-transfer  verification  has
          nothing  to do with this option’s before-the-transfer "Does this
          file need to be updated?" check.

          For protocol 30 and  beyond  (first  supported  in  3.0.0),  the
          checksum used is MD5.  For older protocols, the checksum used is
          MD4.

Note that rsync doesn't make any mention of modification times here and if I try rsync I find it does update the modification time with the --checksum flag.

Originally --checksum was introduced for remotes which didn't support modification times hence the limitation.

So to retain rsync compatibility we should update the modtime when using the --checksum flag.

However to retain backwards compatibility with previous versions of rclone we don't want to update the modtime. Imagine someone has set up a large s3 to s3 sync with --checksum which is the most efficient way of doing it. Making rclone set modtimes might cause every file to have its modtime set in the destination which would cause lots of expensive COPY operations on s3.

We could use a flag as you suggested or we could use an existing flag

  --no-update-modtime    Don't update destination mod-time if files identical.

And set the default behaviour to set the modtime and call the change out in the release notes. This would then make the syncing consistent between modtime/size/checksum modes and consistent with rsync.

That would at least avoid having to add another flag which certainly confuse users!

--update-modtime Update the modtime if it is incorrect when using --checksum

Thoughts? @nickgaya and @ivandeex ?

imho, rclone has made it this far as is, so if a change is needed, add a new flag.

the unintended consequences for S3 remotes, could get very expensive, very quickly.

thanks

To preserve backward compatibility, I think a new flag is needed. I proposed --update-modtime as it's essentially the opposite of --no-update-modtime.

Updating the default behavior of --checksum and using the existing --no-update-modtime flag would work, but it could lead to unexpected cost or performance impacts for users who weren't aware of the change, depending on the remote.

@ncw FYI, I have a working implemention on a branch, so if/when there's some agreement on the desired behavior I can put up a PR.

@ncw
I am always in favor of keeping backwards compatibility. But this case is special. Having one command with two opposite flags is nonsense. It will solve the case of this user but altogether make the sync command useless for others. Expect more confused users bombing this forum tomorrow. Do you remember nero? This once good program was killed by uncontrolled bloat of features.

We have a ticket for writing a new documentation about sync flags. I made an attempt some time ago but failed to wrap my head. You will not grok my pain until you try yourself.

I think we should ask everyone who wants new flag to donate this document first as an entry fee. I am not joking! Writing and reviewing this doc will let us have the big picture synchronized across team.

tl;dr My opinion is either use existing flag (breaking backwards compat) or reject the request until we have the doc.

and create a flag madness :-1:

unless user uses --no-update-modtime explicitly described in flag docs

changelog and release notes will contain a breaking change clause:

note for S3 users:
you will see slowdown with sync command.
this is unfortunate but by design.
if you are worried decreased performance, please use --no-update-xxx-yyy
since rclone v1.zzz

tl;dr Let's admit we are in a zugzwang situation here. We don't have a solution that fits all.
I'd rather keep things as is on master, work on sync docs and let @nickgaya use their fork with whatever flags helping to fulfil their projects.

1 Like

we agree,

@ncw was discussing breaking backwards compatibility and/or creating a new flag.
i simply commented that
IF a change is needed, then add a new flag and not to break backward compatibility.

as for the need for a new flag, i defer to ncw and yourself.