I have a slow internet upload and have been consolidating years of disparate hard drives and uploading to Gsuite Google Drive for longer-term storage. To reduce the amount I have to upload I have exported all the MD5 hashes from Google Drive and wanted to remove any files locally that already had a hash, I was going to do this progromatically by writing something that parsed the json files and compared the hashes. However I'm finding hashing files on the local machine is incredibly slow. I even tried lsf as I can specify one type of hash there unlike lsjson.
Is there anything I can do to speed this up? I tried using Get-FileHash in powershell and it was so much faster, I haven't actually got rclone to finish yet with a 30GB file.
What is your rclone version (output from rclone version)
rclone v1.51.0
os/arch: windows/amd64
go version: go1.13.7
Which OS you are using and how many bits (eg Windows 7, 64 bit)
Windows 10 Pro N 1909 64bit
Which cloud storage system are you using? (eg Google Drive)
N/A
The command you were trying to run (eg rclone copy /tmp remote:tmp)
rclone.exe lsf --hash MD5 --format hp --files-only c:\path...
A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)
2020/04/28 11:50:15 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone.exe" "lsf" "--hash" "MD5" "--format" "hp" "--files-only" "-vv" "c:\Users\...\Desktop\rclone-v1.51.0-windows-amd64"]
2020/04/28 11:50:15 DEBUG : Using config file from "C:\Users\...\.config\rclone\rclone.conf"
My guess is that rclone is using a single thread to produce multiple hash types then discarding all but the MD5 at the end, and this is why it's slow as it's CPU bottle necked on a single core.
I think you are over complicating it, but not 100% sure.
If you are copying to a Google Drive remote, using just rclone copy, it'll first check size/mod time and validate those are the same. If it's not there, it will checksum the file to validate the copy was right on the destination.
felix@gemini:~$ rclone copy /etc/hosts GD: -vv
2020/04/28 08:00:36 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2020/04/28 08:00:36 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:00:36 DEBUG : hosts: Need to transfer - File not found at Destination
2020/04/28 08:00:38 DEBUG : hosts: MD5 = c0c69fcfeb6162dd8065b5a5a61e70e4 OK
2020/04/28 08:00:38 INFO : hosts: Copied (new)
2020/04/28 08:00:38 INFO :
Transferred: 266 / 266 Bytes, 100%, 171 Bytes/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 1.5s
2020/04/28 08:00:38 DEBUG : 6 go routines active
2020/04/28 08:00:38 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
felix@gemini:~$ rclone copy /etc/hosts GD: -vv
2020/04/28 08:00:40 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
2020/04/28 08:00:40 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:00:40 DEBUG : hosts: Size and modification time the same (differ by -269.312µs, within tolerance 1ms)
2020/04/28 08:00:40 DEBUG : hosts: Unchanged skipping
2020/04/28 08:00:40 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks: 1 / 1, 100%
Elapsed time: 0.0s
2020/04/28 08:00:40 DEBUG : 5 go routines active
2020/04/28 08:00:40 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv"]
felix@gemini:~$ rclone copy /etc/hosts GD: -vv --checksum
2020/04/28 08:02:35 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv" "--checksum"]
2020/04/28 08:02:35 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/04/28 08:02:35 DEBUG : hosts: MD5 = c0c69fcfeb6162dd8065b5a5a61e70e4 OK
2020/04/28 08:02:35 DEBUG : hosts: Size and MD5 of src and dst objects identical
2020/04/28 08:02:35 DEBUG : hosts: Unchanged skipping
2020/04/28 08:02:35 INFO :
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Checks: 1 / 1, 100%
Elapsed time: 0.0s
2020/04/28 08:02:35 DEBUG : 5 go routines active
2020/04/28 08:02:35 DEBUG : rclone: Version "v1.51.0" finishing with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vv" "--checksum"]
felix@gemini:~$
When I check a md5sum with a few different commands, I get a 15GB file that takes about 1min40 seconds on a non SSD disk as it has to read 15GB of data as it's disk bound, not CPU bound.
rclone m5dsum takes the same mount of time for me on the file.
What does so much faster mean? It took how many seconds/minutes? What command did you run to compare?
It seems to be related to the rclone lsf and how it seems to generate the md5 rather than the actual md5sum from my testing. I don't see anything in the -vv output though to point to me something so @ncw might need to weigh in.
I started a rclone lsf md5sum and that still is not done yet and while that was going, I ran a md5sum byself and that finished pretty quick:
The md5sum was 3 seconds faster than the Powershell Get-FileHash (see edit above) unfortunatly isn't that a single hash for an entire folder? I would like to use lsjson but you can't specify MD5 for that and it generates all the hashes.
For a start it's probably worth rewriting that to only read the file once and perform all hash types on the same io stream so it only has to read it once, also somehow make it multithreaded so it makes use of the available CPU better and finally an option to specify a single hash type would be great so it can be targeted and reduce CPU usage for what is needed.
I will raise an issue on github, but in the mean time I can just build a copy of rclone that only performs MD5 hashes on the file system.
I think you are right... the problem is the whirlpool hash algorithm which is very slow...
$ time md5sum 1G
cd573cfaace07e7949bc0c46028904ff 1G
real 0m2.160s
user 0m1.773s
sys 0m0.382s
$ time rclone hashsum MD5 1G
cd573cfaace07e7949bc0c46028904ff 1G
real 0m2.126s
user 0m2.039s
sys 0m0.454s
$ time rclone lsf --hash MD5 --format hp 1G
cd573cfaace07e7949bc0c46028904ff;1G
real 0m52.143s
user 0m50.705s
sys 0m3.831s