Local Filesystem Unexpected MD5 Calculation

What is the problem you are having with rclone?

I am attempting to use rclone to sync between two Windows network shares. The shares are for example \\server\from and \\server\to. I have these mapped as network drives T:\ and U:\ respectively.

When running a sync command between the two the command runs very slowly despite the network speed being quite high. What I found was that rclone is computing md5 hashes of the files.

It's not clear to me why the hashes are being computed. The logging shows that the modification times of the files do differ (by a somewhat surprising amount), but the documentation would suggest (at least to me) that unless the --checksum flag is supplied, only file size and modification time is used to detect file equality. (Documentation)

I can confirm that this is the cause of the slowdown by either using --modify-window=10s or --size-only. In both cases the command completes very quickly as expected. You can also see that the log copied below also displays that a md5 was computed.

I suspect that if I just ran the command once the mod times would get updated and the next iteration would be fine. I'm also not debating on the use of --checksum. My question is why the md5 was computed at all.

In the off chance that the -n flag does everything except destructive changes, I also tested with supplying the --ignore-checksum flag, but the hashes were still computed.

Run the command 'rclone version' and share the full output of the command.

rclone v1.62.2

  • os/version: Microsoft Windows 10 Pro 22H2 (64 bit)
  • os/kernel: 10.0.19045.2728 Build 19045.2728.2728 (x86_64)
  • os/type: windows
  • os/arch: amd64
  • go/version: go1.20.2
  • go/linking: static
  • go/tags: cmount

Which cloud storage system are you using? (eg Google Drive)

Local filesystem to local filesystem (via network share)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone sync t:\ u:\ --progress --stats=10s -n --filter-from BackupExternalFilter.txt -vv

The rclone config contents with secrets removed.


A log from the command with the -vv flag

2023/03/28 12:51:03 DEBUG : rclone: Version "v1.62.2" starting with parameters ["rclone" "sync" "t:\\" "u:\\" "--progress" "--stats=10s" "-n" "--filter-from" "BackupExternalFilter.txt" "-vv"]
2023/03/28 12:51:03 DEBUG : Creating backend with remote "t:\\"
2023/03/28 12:51:03 DEBUG : Using config file from "C:\\Users\\fjih\\AppData\\Roaming\\rclone\\rclone.conf"
2023/03/28 12:51:03 DEBUG : fs cache: renaming cache item "t:\\" to be canonical "//?/t:/"
2023/03/28 12:51:03 DEBUG : Creating backend with remote "u:\\"
2023/03/28 12:51:03 DEBUG : fs cache: renaming cache item "u:\\" to be canonical "//?/u:/"
2023-03-28 12:51:03 DEBUG : $RECYCLE.BIN: Excluded
2023-03-28 12:51:03 DEBUG : $RECYCLE.BIN: Excluded
2023-03-28 12:51:03 DEBUG : Docker: Excluded
2023-03-28 12:51:03 DEBUG : Docker: Excluded
2023-03-28 12:51:03 DEBUG : OneDriveTemp: Excluded
2023-03-28 12:51:03 DEBUG : Store: Excluded
2023-03-28 12:51:03 DEBUG : System Volume Information: Excluded
2023-03-28 12:51:03 DEBUG : OneDriveTemp: Excluded
2023-03-28 12:51:03 DEBUG : Store: Excluded
2023-03-28 12:51:03 DEBUG : System Volume Information: Excluded
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20230205/Venom-Server/Virtual Hard Disks/Venom-Server.vhdx: Modification times differ by -897.8216ms: 2023-02-05 12:53:35.8978216 -0800 PST, 2023-02-05 12:53:35 -0800 PST
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20190624/Win10HomeScratch/Snapshots/144FFB65-B2BD-444E-BEFF-6C506E603059.vmcx: Modification times differ by -408.5146ms: 2019-06-24 00:21:54.4085146 -0700 PDT, 2019-06-24 00:21:54 -0700 PDT
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20190624/Win10HomeScratch/Snapshots/144FFB65-B2BD-444E-BEFF-6C506E603059.vmgs: Modification times differ by -88.5141ms: 2019-06-24 00:21:54.0885141 -0700 PDT, 2019-06-24 00:21:54 -0700 PDT
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20220308/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.VMRS: Modification times differ by -911.6332ms: 2022-03-08 23:12:46.9116332 -0800 PST, 2022-03-08 23:12:46 -0800 PST
2023-03-28 12:51:04 DEBUG : BackUp/Virtual Machines/20230205/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.vmcx: Modification times differ by -225.5415ms: 2023-02-05 12:53:36.2255415 -0800 PST, 2023-02-05 12:53:36 -0800 PST
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20190624/Win10HomeScratch/Snapshots/144FFB65-B2BD-444E-BEFF-6C506E603059.VMRS: Modification times differ by -43.515ms: 2019-06-24 00:21:54.043515 -0700 PDT, 2019-06-24 00:21:54 -0700 PDT
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20220308/Venom-Server/Virtual Hard Disks/Venom-Server.vhdx: Modification times differ by -771.7332ms: 2022-03-08 23:12:46.7717332 -0800 PST, 2022-03-08 23:12:46 -0800 PST
2023-03-28 12:51:03 DEBUG : BackUp/Virtual Machines/20220308/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.vmcx: Modification times differ by -911.6333ms: 2022-03-08 23:12:46.9116333 -0800 PST, 2022-03-08 23:12:46 -0800 PST
2023-03-28 12:51:04 DEBUG : BackUp/Virtual Machines/20190624/Win10HomeScratch/Snapshots/144FFB65-B2BD-444E-BEFF-6C506E603059.vmcx: md5 = 8de7ddcc879c706a1857882134bc53ed OK
2023-03-28 12:51:04 NOTICE: BackUp/Virtual Machines/20190624/Win10HomeScratch/Snapshots/144FFB65-B2BD-444E-BEFF-6C506E603059.vmcx: Skipped update modification time as --dry-run is set (size 51.119Ki)
2023-03-28 12:51:04 DEBUG : BackUp/Virtual Machines/20190624/Win10HomeScratch/Snapshots/144FFB65-B2BD-444E-BEFF-6C506E603059.vmcx: Unchanged skipping
2023-03-28 12:51:04 DEBUG : BackUp/Virtual Machines/20230205/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.VMRS: Modification times differ by -225.5414ms: 2023-02-05 12:53:36.2255414 -0800 PST, 2023-02-05 12:53:36 -0800 PST
2023-03-28 12:51:04 DEBUG : BackUp/Virtual Machines/20220308/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.VMRS: md5 = 2a1abfcaa1ce7019c2259f8fab43348f OK
2023-03-28 12:51:04 NOTICE: BackUp/Virtual Machines/20220308/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.VMRS: Skipped update modification time as --dry-run is set (size 28Ki)
2023-03-28 12:51:04 DEBUG : BackUp/Virtual Machines/20220308/Venom-Server/Virtual Machines/733931D5-5E97-4E26-9B66-E724BE1767A0.VMRS: Unchanged skipping

hello and welcome to the forum,

that is default rclone behavior, to verify file transfers.

tho i get your point, when using --dry-run is rclone calculating the hash

Thank you for your response. I was just thinking about that and edited my original post.
Two things come to mind:

  1. The command I used included the -n flag for my initial testing. No files were actually transferred.
  2. In case the -n command does everything except destructive operations including verification (even if no file was transferred), I added the --ignore-checksum flag. Despite this, md5s were computed.

tl;dir, rclone is very flexible, figure out what behavior you need, then find the flags to do that.
might try to copy a single file without --dry-run

perhaps this is what is going on.

rclone is cloud focused.
most cloud providers store the md5 hash. cheap and usually easy to compare hashes.
and can be very expensive in time and money to re-copy a 100GiB file when a simple change of modtime would suffice.

if you have two files, same name but differ in timestamps by more than 1ns, then what is rclone to do?
rclone can dumbly just copy the file, overwrite the dest, which can be an expensive operation.
might be cheaper to check the source hash, the dest hash, if they are the same, then same file.
in this case, rclone can just update the modtime, not copy a 100GiB file for no legit reason.

sure, if you set --modify-window greater than 1ns, rclone will ignore the source file, not compute the checksums, not compare the source to the dest.

The rationale for using rclone for local-to-local was to standardize my toolset. I plan on using it for cloud sync as well, so instead of having yet another tool, the idea is to use rclone for all my sync jobs.

I agree with your example, but it appears to me that either the documentation is incorrect, or the flexibility of rclone is not complete. The documentation says that it uses file size and mod time to check for equality unless --checksum is supplied, but it does not appear to be operating in this manner.

IMO, to answer what rclone should do in the example you provided, rclone should follow the commands from the user. The user specifies the type of equality checks that they want rclone to use. If the user does not want to potentially recopy a 100GiB file when it's just a modtime change, and the filesystems they are using support hash comparison, then --checksum is the correct flag to use. If the user does not specify --checksum, they are telling rclone that they only want to compare by file size and modtime. This should be honored.

To be fully flexible, the full set of options to define file equality should be available. Perhaps what would be best is a --equality-comparison option akin to the --track-renames-strategy. The user could supply --equality-comparison hash,modtime,size and pick what checks are used. A default of size,modtime would match the documented behavior, and the user could use size,hash to be equivalent with --checksum.

I should be clear here, I'm in no way down on rclone. It is an amazing offering, but the operation result I received for the command I supplied doesn't match expectation.

run a few test commands, copy single source file without --dry-run and then will know what to expect.
fwiw, rclone is what it is, i run a few tests, figure out what flags i need, i try to adjust to rclone.

not sure what you need help with?
what is your specific use-case that rclone cannot handle using existing flags and a little bit of testing?

welcome to the start a new topic, using feature template and explain in detail.
i believe this has been discussed a few times in the forum, can search for that.
currently, there are over 700+ open, unresolved issues at github......

How do I sync with only file size and modtime used as the equality comparison?

If I get a chance I'll pull down the source and see why the md5 is being calculated. I agree that we all adjust to the tool being used. I'm just trying to get clarity on what seems like a documentation issue.

the docs can always use a tweak. if you have a specific example?

https://rclone.org/commands/rclone_sync/
"testing by size and modification time"

https://rclone.org/docs/#c-checksum
"rclone will look at modification time and size of files to see if they are equal"

https://rclone.org/commands/rclone_copy/
"testing by size and modification time"

Thank you for your attempts to help me out here, but I think something is getting lost along the way.

I linked to the --checksum flag explicitly in my original post.

I am explicitly not providing the --checksum flag.

The documentation states, "Normally rclone will look at modification time and size of files to see if they are equal. If you set this flag then rclone will check the file hash and size to determine if files are equal."

So I am attempting to use the "Normally" case described.

However, I am still getting hashes being computed instead of the "look at modification time and size of files to see if they are equal"

if the sizes are equal and the modtimes are not equal, then what is rclone to do?
the flags determine rclone's behavior.

i could be wrong, not the first time.
let's wait and see what others comment

What is happening is that the modification times differ by more than rclone is expecting, so it does a checksum of the source and the dest to see if they are the same.

The reason rclone thinks the modification times are too far out is I suspect one of your volumes formattted VFAT rather than NTFS. VFAT can only set modification times accurate to 1s.

So if you set --modify-window 1s or maybe 2s I think this problem will disappear.

Thanks Nick. Unfortunately, the thread has gotten more lengthy than expected, but the core question is not why the files are considered different. The core question is why the hash was computed to determine the files were different.

I know that the mod times are different and that using --modify-window can let override and say that they are not different.

My question is, how do I say that they are different based on the difference in modtime.

The documentation for --checksum states:

-c, --checksum
Normally rclone will look at modification time and size of files to see if they are equal. If you set this flag then rclone will check the file hash and size to determine if files are equal.

I interpreted this as (ignoring --size-only, --modify-window, and --checksum flags for clarity):

if (src.size != dst.size || src.modtime != dst.modtime)
    // files differ
    copy

What I'm coming to the realization is, the functionality is actually:

if (src.size != dst.size || (src.modtime != dst.modtime && (!src.supports_hash || !dst.supports_hash || src.hash() != dst.hash()))
    // files differ
    copy

Where supports_hash checks for the same algorithms.

I attempted a check to see if this was the case even outside of local filesystems by performing a sync from local to OneDrive. My thought was to run rclone sync local onedrive: --modify-window=1ns. This would surely have mismatched modtimes. OneDrive supports MD5 hashing, so I expected there to be more md5 computations. Unfortunately my --modify-window setting was ignored, and the default 1s was used. As such, I could not verify it this way.

I can see the argument for falling back to hash if it's available. Definitely on filesystems that have it stored it's a cheap check. Arguably, for the local filesystem, if you're willing to do the copy you might also be willing to first do a hash of the files. It's an interesting discussion.

Which leaves me with the feeling that the documentation is, at the very least, misleading. I would have much preferred to see (apologies, I'm definitely not the best at wording for documentation, but I'll do my best):

-c, --checksum
Normally rclone will look at modification time and size of files to see if they are equal. If files differ only by modification time, rclone will attempt to compare the files by hash. If you set this flag then rclone will only check the file hash and size to determine if files are equal.

I hope this makes my original question clearer.

The thinking behind this logic is this

We've detected the file is the same size but has a different modtime.

Let's try to avoid an expensive copy by comparing checksums first and if the checksums are the same, just update the modtime.

I take your point that on a local -> local copy this is perhaps counterproductive as making checksums for the source and the dest is probably about as expensive as copying the file.

This algorithm is explained several times in the rclone docs, but not where anyone would look for it, so I think there is definitely room for improvements in the docs.

Perhaps we could detect your initial problem of needing a --modify-window flag a bit better. We work out the modification times by writing a file in the temporary directory and reading back its modification time and I don't want to do this to the source & dest of the sync.

However we could be a bit clevererer in the sync itself. If we detected that the difference in modtime was less than one second and that one of the modtimes was rounded to a second boundary whereas the other wasn't then we could output a warning.

Alternatively we'd detect if the file system we are reading/writing is a VFAT file system and set its precision accordingly. Do you know an easy way of doing this on Windows? ideally we'd be able to read the precision of the timestamps.

Thanks Nick. Would you point me in the direction of the portion of the docs that outlines the algorithm? It's difficult to convey on a forum, but I promise this isn't snarky. Given the sensitive nature of data backups I want to fully understand the process of the tool used. Hence I've been running things with -n to test things out first.

Again, not snarky!, but here's where I was confused. In the event that I can help make the documentation better for the next user, I consider it a win for everyone.

"About rclone" says "It preserves timestamps and verifies checksums at all times".
The Synopsis of "rclone sync" says "Doesn't transfer files that are identical on source and destination, testing by size and modification time or MD5SUM"
"--checksum" says "If you set this flag then rclone will check the file hash and size to determine if files are equal"
"rclone check" does say "Checks the files in the source and destination match. It compares sizes and hashes (MD5 or SHA1)". However, if sync was using the same level of checking as described here, the flag --checksum would be redundant.

My initial problem wasn't that I needed a --modify-window setting, but rather that I was confused as to why the md5 was computed. Normally, I would have run my command with -n, seen that it considered a lot of files as different due to a large modtime discrepancy, and then found the --modify-window setting to address it if needed.

I think this whole thing comes down to two things:

First, a misunderstanding of how the algorithm decides files differ. My understanding didn't include the use of hashing in my scenario. I tried to RTFM, but there's a lot there (which is a great thing), and perhaps missed some key information or interpreted it incorrectly.

Second, the highly protective nature of rclone wrt to the data (an amazing thing). Rclone is attempting to do everything to ensure that it doesn't destructively modify the destination files by going for the extra hashing check. The issue here is that the cost of the hash isn't necessarily low, and there's no way to override this behavior from what I can tell. It would seem that a --no-checksum flag is needed here to explicitly disable the hash check. To maintain its protective nature, the hashing is enabled, but a user can explicitly say they do not want to perform the hash check.

It might be a breaking change, but I think the explicit flag I outlined in my earlier posts would be the most clear.
If we had a --equality-comparison setting that defined how file equality is determined, a user could explicitly decide what they want. It would follow the settings from --track-renames-strategy, so --equality-comparison=hash,modtime,size is the default, but in my use case I could have set modtime,size.

AFAIK, there's no way to detect the file time precision in Windows. I believe the precision is the same across all filesystems, but the accuracy just varies. I don't know that there's a Windows constant that defines the accuracy level per filesystem. Window's own Robocopy has a /FFT flag to explicitly set the modtime granularity down to 2s.

Anyways, at this point I fear I'm just being pedantic or worse, so I'll let everyone get back to more important things :slight_smile:

The problem with the rclone docs is that is is based around the flags. So all the modifications to the sync method are described but the main sync methods with no flags isn't!

If you search for checksum in the main docs page you'll find the places I'm referring to. The closest to the full explanation is in the --refresh-times flag.

I think we should probably have a dedicated syncing page which describes what happens normally and how all the flags (there are lots!) change it.

There isn't quite a flag to do that.

If there was it would go around here in the code

Yes I've had that idea before - it would be nice to be able to configure the equality in that way.

If we can work out the Precision of the fs then it would be a great boost for users as there are lots of questions on the forum about --modify-window and syncing to VFAT file systems.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.