MD5 hashing local filesystem is very slow

What is the problem you are having with rclone?

I have a slow internet upload and have been consolidating years of disparate hard drives and uploading to Gsuite Google Drive for longer-term storage. To reduce the amount I have to upload I have exported all the MD5 hashes from Google Drive and wanted to remove any files locally that already had a hash, I was going to do this progromatically by writing something that parsed the json files and compared the hashes. However I'm finding hashing files on the local machine is incredibly slow. I even tried lsf as I can specify one type of hash there unlike lsjson.

Is there anything I can do to speed this up? I tried using Get-FileHash in powershell and it was so much faster, I haven't actually got rclone to finish yet with a 30GB file.

What is your rclone version (output from rclone version)

rclone v1.51.0

  • os/arch: windows/amd64
  • go version: go1.13.7

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Windows 10 Pro N 1909 64bit

Which cloud storage system are you using? (eg Google Drive)

N/A

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone.exe lsf --hash MD5 --format hp --files-only c:\path...

A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)

2020/04/28 11:50:15 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone.exe" "lsf" "--hash" "MD5" "--format" "hp" "--files-only" "-vv" "c:\Users\...\Desktop\rclone-v1.51.0-windows-amd64"]
2020/04/28 11:50:15 DEBUG : Using config file from "C:\Users\...\.config\rclone\rclone.conf"

My guess is that rclone is using a single thread to produce multiple hash types then discarding all but the MD5 at the end, and this is why it's slow as it's CPU bottle necked on a single core.

What are you copying to? A regular remote? A crypted remote?

I will be copying to a regular google drive remote.

I think you are over complicating it, but not 100% sure.

If you are copying to a Google Drive remote, using just rclone copy, it'll first check size/mod time and validate those are the same. If it's not there, it will checksum the file to validate the copy was right on the destination.

felix@gemini:~$ rclone copy /etc/hosts GD: -vv
2020/04/28 08:00:36 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2020/04/28 08:00:36 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:00:36 DEBUG : hosts: Need to transfer - File not found at Destination
2020/04/28 08:00:38 DEBUG : hosts: MD5 = c0c69fcfeb6162dd8065b5a5a61e70e4 OK
2020/04/28 08:00:38 INFO  : hosts: Copied (new)
2020/04/28 08:00:38 INFO  :
Transferred:   	       266 / 266 Bytes, 100%, 171 Bytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:         1.5s

2020/04/28 08:00:38 DEBUG : 6 go routines active
2020/04/28 08:00:38 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
felix@gemini:~$ rclone copy /etc/hosts GD: -vv
2020/04/28 08:00:40 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2020/04/28 08:00:40 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:00:40 DEBUG : hosts: Size and modification time the same (differ by -269.312µs, within tolerance 1ms)
2020/04/28 08:00:40 DEBUG : hosts: Unchanged skipping
2020/04/28 08:00:40 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:                 1 / 1, 100%
Elapsed time:         0.0s

2020/04/28 08:00:40 DEBUG : 5 go routines active
2020/04/28 08:00:40 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
felix@gemini:~$ rclone copy /etc/hosts GD: -vv --checksum
2020/04/28 08:02:35 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv" "--checksum"]
2020/04/28 08:02:35 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:02:35 DEBUG : hosts: MD5 = c0c69fcfeb6162dd8065b5a5a61e70e4 OK
2020/04/28 08:02:35 DEBUG : hosts: Size and MD5 of src and dst objects identical
2020/04/28 08:02:35 DEBUG : hosts: Unchanged skipping
2020/04/28 08:02:35 INFO  :
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks:                 1 / 1, 100%
Elapsed time:         0.0s

2020/04/28 08:02:35 DEBUG : 5 go routines active
2020/04/28 08:02:35 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv" "--checksum"]
felix@gemini:~$

When I check a md5sum with a few different commands, I get a 15GB file that takes about 1min40 seconds on a non SSD disk as it has to read 15GB of data as it's disk bound, not CPU bound.

rclone m5dsum takes the same mount of time for me on the file.

What does so much faster mean? It took how many seconds/minutes? What command did you run to compare?

Measure-Command { Get-FileHash -Path .\test.tar -Algorithm MD5 }

Minutes           : 2
Seconds           : 13
Milliseconds      : 826
Ticks             : 1338268040
TotalDays         : 0.00154892134259259
TotalHours        : 0.0371741122222222
TotalMinutes      : 2.23044673333333
TotalSeconds      : 133.826804
TotalMilliseconds : 133826.804



Measure-Command { .\rclone.exe lsf --hash MD5 --format hp --files-only -vv test.tar }

Minutes           : 33
Seconds           : 36
Milliseconds      : 634
Ticks             : 20166344083
TotalDays         : 0.0233406760219907
TotalHours        : 0.560176224527778
TotalMinutes      : 33.6105734716667
TotalSeconds      : 2016.6344083
TotalMilliseconds : 2016634.4083

Measure-Command {.\rclone.exe md5sum -vv .\test.tar}

Minutes           : 2
Seconds           : 10
Milliseconds      : 33
Ticks             : 1300334860
TotalDays         : 0.00150501719907407
TotalHours        : 0.0361204127777778
TotalMinutes      : 2.16722476666667
TotalSeconds      : 130.033486
TotalMilliseconds : 130033.486

It's been 10 mins so far, when it finished I'll update this post.

1 Like

Can you just run:

rclone md5sum test.tar and see if that takes the same amount of time for you?

It seems to be related to the rclone lsf and how it seems to generate the md5 rather than the actual md5sum from my testing. I don't see anything in the -vv output though to point to me something so @ncw might need to weigh in.

I started a rclone lsf md5sum and that still is not done yet and while that was going, I ran a md5sum byself and that finished pretty quick:

The md5sum was 3 seconds faster than the Powershell Get-FileHash (see edit above) :slight_smile: unfortunatly isn't that a single hash for an entire folder? I would like to use lsjson but you can't specify MD5 for that and it generates all the hashes.

I only did one file in the lsf and it took about 20 minutes compared to about 90 seconds when running the md5sum solo so something is a bit off:

felix@gemini:/local/Movies/Alice Through the Looking Glass (2016)$ rclone lsf --hash MD5 --format hp --files-only -vv Alice\ Through\ the\ Looking\ Glass\ \(2016\).mkv
2020/04/28 08:41:18 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "lsf" "--hash" "MD5" "--format" "hp" "--files-only" "-vv" "Alice Through the Looking Glass (2016).mkv"]
2020/04/28 08:41:18 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:41:18 DEBUG : Alice Through the Looking Glass (2016).en.forced.srt: Excluded
d163913508cfa57fdddbd168ca34a3d5;Alice Through the Looking Glass (2016).mkv
2020/04/28 09:02:36 DEBUG : 3 go routines active
2020/04/28 09:02:36 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "lsf" "--hash" "MD5" "--format" "hp" "--files-only" "-vv" "Alice Through the Looking Glass (2016).mkv"]

I cloned the repo and have stepped through the code, it looks like lsf actually calls lsjson and it performs each type of hash sequentially. https://github.com/rclone/rclone/blob/master/fs/operations/lsjson.go#L155

For a start it's probably worth rewriting that to only read the file once and perform all hash types on the same io stream so it only has to read it once, also somehow make it multithreaded so it makes use of the available CPU better and finally an option to specify a single hash type would be great so it can be targeted and reduce CPU usage for what is needed.

I will raise an issue on github, but in the mean time I can just build a copy of rclone that only performs MD5 hashes on the file system.

I think you are right... the problem is the whirlpool hash algorithm which is very slow...

$ time md5sum 1G
cd573cfaace07e7949bc0c46028904ff  1G

real	0m2.160s
user	0m1.773s
sys	0m0.382s

$ time rclone hashsum MD5 1G
cd573cfaace07e7949bc0c46028904ff  1G

real	0m2.126s
user	0m2.039s
sys	0m0.454s

$ time rclone lsf --hash MD5 --format hp 1G
cd573cfaace07e7949bc0c46028904ff;1G

real	0m52.143s
user	0m50.705s
sys	0m3.831s

I'll reply further on the issue

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.