Uploading to remote and calculating local md5sum on the fly

Manuel_F · March 18, 2022, 11:03am

Hi,
I need some advice to speed things up.
I upload very big files to Google Drive and I would like to calculate md5sum of the local file while uploading and not prior to it. Right now I do it in two steps:

rclone md5sum test.txt > test.txt.md5
rclone copy test.txt REMOTE:
And then I can compare local and remote checksums

But I would like to save time by doing it concurrently and reading just once, because the files are big.
I mostly use Linux in case there is a better shell alternative to it, which I could not find.

Thanks,
Manuel F.

Ole · March 18, 2022, 12:53pm

Hi Manuel,

It is my understanding that this already happens in rclone, so it shouldn't be necessary to do the extra md5 calculation and comparison in your script.

Here are a few quotes from https://rclone.org/:

Rclone really looks after your data. It preserves timestamps and verifies checksums at all times.

Features

Transfers

MD5, SHA1 hashes are checked at all times for file integrity

Do you have situations where this seems not to be the case?

Animosity022 · March 18, 2022, 12:56pm

Just to add some output as it already does that and you can see it in the debug log.

 rclone copy /etc/hosts GD: -vvv
2022/03/18 08:55:54 DEBUG : Setting --config "/opt/rclone/rclone.conf" from environment variable RCLONE_CONFIG="/opt/rclone/rclone.conf"
2022/03/18 08:55:54 DEBUG : rclone: Version "v1.57.0" starting with parameters ["rclone" "copy" "/etc/hosts" "GD:" "-vvv"]
2022/03/18 08:55:54 DEBUG : Creating backend with remote "/etc/hosts"
2022/03/18 08:55:54 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2022/03/18 08:55:54 DEBUG : fs cache: adding new entry for parent of "/etc/hosts", "/etc"
2022/03/18 08:55:54 DEBUG : Creating backend with remote "GD:"
2022/03/18 08:55:54 DEBUG : Google drive root '': 'root_folder_id = 0AGoj85v3xeadUk9PVA' - save this in the config to speed up startup
2022/03/18 08:55:55 DEBUG : hosts: Need to transfer - File not found at Destination
2022/03/18 08:55:56 DEBUG : hosts: md5 = 0d603989599e838cf5fa5822d7257329 OK
2022/03/18 08:55:56 INFO  : hosts: Copied (new)
2022/03/18 08:55:56 INFO  :
Transferred:   	        184 B / 184 B, 100%, 183 B/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:         1.3s

2022/03/18 08:55:56 DEBUG : 6 go routines active

Manuel_F · March 18, 2022, 4:49pm

Thank you, Ole & Animosity022,

I had not realized that debugging shows you what I need, but I oversimplified my example: Most of the transfers I do I don't use "copy" but the newer "rcat", and it's there when I need it the most. Maybe the checksum it's not implemented there?

Thanks again
Manuel F.

cat /LOG/test.txt | rclone rcat -vvv SAR:kk/test.txt
2022/03/18 16:38:56 DEBUG : rclone: Version "v1.57.0" starting with parameters ["rclone" "rcat" "-vvv" "-P" "SAR:kk/test.txt"]
2022/03/18 16:38:56 DEBUG : Creating backend with remote "SAR:kk/"
2022/03/18 16:38:56 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2022/03/18 16:38:57 DEBUG : fs cache: renaming cache item "SAR:kk/" to be canonical "SAR:kk"
2022-03-18 16:38:58 DEBUG : test.txt: Sending chunk 0 length 100000000
2022-03-18 16:39:01 DEBUG : test.txt: Size and modification time the same (differ by -506.926µs, within tolerance 1ms)
Transferred: 95.367 MiB / 95.367 MiB, 100%, 23.838 MiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 4.6s
2022/03/18 16:39:01 DEBUG : 6 go routines active

Animosity022 · March 18, 2022, 5:03pm

You can't checksum on the fly to a pipe which is what you are doing as you had only shared rclone copy as those commands checksum since you a file input.

If you had a source file, you'd use copy/sync/move/etc.

If you are using a pipe, there isn't a source file generally as you'd be piping into something from multiple inputs/files making a md5sum useless.

asdffdsa · March 18, 2022, 7:58pm

sure, in this case, just use rclone copy
but it does seem possible to use rcat and have rclone calculate the md5 hash

type  d:\test\6Mi\file.txt   | rclone rcat gdrive:zork/file.txt -vv --stats=5h --checksum 
DEBUG : Setting --config "C:\\data\\rclone\\rclone.conf" from environment variable RCLONE_CONFIG="C:\\data\\rclone\\rclone.conf"
DEBUG : rclone: Version "v1.57.0" starting with parameters ["C:\\data\\rclone\\rclone" "rcat" "gdrive:zork/file.txt" "-vv" "--stats=5h" "--checksum"]
DEBUG : Creating backend with remote "gdrive:zork/"
DEBUG : Using config file from "C:\\data\\rclone\\rclone.conf"
DEBUG : Google drive root 'zork': 'root_folder_id = 0AIYnsu88uXytUk9PVA' - save this in the config to speed up startup
DEBUG : fs cache: renaming cache item "gdrive:zork/" to be canonical "gdrive:zork"
DEBUG : file.txt: Sending chunk 0 length 6291456
DEBUG : file.txt: md5 = fc873c046f6fdf1384d2cba806e69d6c OK
DEBUG : file.txt: Size and md5 of src and dst objects identical
INFO  : 
Transferred:   	        6 MiB / 6 MiB, 100%, 2.996 MiB/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:         3.3s

and without --checksum,

type  d:\test\6Mi\file.txt   | rclone rcat gdrive:zork/file.txt -vv --stats=5h 
DEBUG : Setting --config "C:\\data\\rclone\\rclone.conf" from environment variable RCLONE_CONFIG="C:\\data\\rclone\\rclone.conf"
DEBUG : rclone: Version "v1.57.0" starting with parameters ["C:\\data\\rclone\\rclone" "rcat" "gdrive:zork/file.txt" "-vv" "--stats=5h"]
DEBUG : Creating backend with remote "gdrive:zork/"
DEBUG : Using config file from "C:\\data\\rclone\\rclone.conf"
DEBUG : Google drive root 'zork': 'root_folder_id = 0AIYnsu88uXytUk9PVA' - save this in the config to speed up startup
DEBUG : fs cache: renaming cache item "gdrive:zork/" to be canonical "gdrive:zork"
DEBUG : file.txt: Sending chunk 0 length 6291456
DEBUG : file.txt: Size and modification time the same (differ by -295.1Âµs, within tolerance 1ms)
INFO  : 
Transferred:   	        6 MiB / 6 MiB, 100%, 2.971 MiB/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:         3.7s

rclone md5sum d:\test\6Mi\file.txt 
fc873c046f6fdf1384d2cba806e69d6c  file.txt

rclone lsf gdrive:zork --format="phs" 
file.txt;fc873c046f6fdf1384d2cba806e69d6c;6291456

Animosity022 · March 18, 2022, 8:11pm

The use case of one file wouldn't matter as you'd just use rclone copy and not pipe anything.

Usually cat or rcat would be used when piping multiple things together as you can't MD5 that / won't match because there are multiple inputs being made into a single file which the single file gets the MD5SUM on the remote and not the pieces making it up.

felix@gemini:~/test$ touch one
felix@gemini:~/test$ touch two
felix@gemini:~/test$ touch three
felix@gemini:~/test$ cat * | rclone rcat GD:test.tar -vvv
2022/03/18 16:09:22 DEBUG : Setting --config "/opt/rclone/rclone.conf" from environment variable RCLONE_CONFIG="/opt/rclone/rclone.conf"
2022/03/18 16:09:22 DEBUG : rclone: Version "v1.58.0" starting with parameters ["rclone" "rcat" "GD:test.tar" "-vvv"]
2022/03/18 16:09:22 DEBUG : Creating backend with remote "GD:"
2022/03/18 16:09:22 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2022/03/18 16:09:22 DEBUG : GD: Loaded invalid token from config file - ignoring
2022/03/18 16:09:22 DEBUG : Saving config "token" in section "GD" of the config file
2022/03/18 16:09:22 DEBUG : GD: Saved new token in config file
2022/03/18 16:09:22 DEBUG : Google drive root '': 'root_folder_id = 0AGoj85v3xeadUk9PVA' - save this in the config to speed up startup
2022/03/18 16:09:22 DEBUG : Google drive root '': File to upload is small (0 bytes), uploading instead of streaming
2022/03/18 16:09:23 DEBUG : test.tar: md5 = d41d8cd98f00b204e9800998ecf8427e OK
2022/03/18 16:09:23 INFO  : test.tar: Copied (new)
2022/03/18 16:09:23 DEBUG : 9 go routines active

If the OP is using cat with one file, I dunno as that doesn't make sense to me from a use case perpsective.

pipes are normally used to combine / stream many to one from my experience.

Manuel_F · March 19, 2022, 8:57am

Hi again,
As I said some of my files are a few TB, and in the future maybe even bigger so I'm testing "rcat/cat" as an alternative to "copy" failures or losses.
Thanks jojo... for the hint but I think that for small files/streams rcat reverts automatically to copy. Anyway I will test it further, so any advice like yours will be welcome.

Manuel F.

Animosity022 · March 19, 2022, 11:01am

I don't understand why if you have single files, you'd want to use cat / rcat instead of copy.

That isn't the case as I shared in my output above.

I think you are really waiting for resumable uploads as if you have a big fly and something happens, it would resume.

Resume uploads · Issue #87 · rclone/rclone (github.com)

Manuel_F · March 19, 2022, 12:00pm

I dont blame you: It's mostly because of GDrive limits: First max 400K files, and second 5TB max filesize (close to but less important).
My contents are NTFS images (VHDX with ACLs). Also yes, I fear of failed connections and "resumability" but I'm happy as it is now by using rcat/cat. Only missing checksums to assure what I'm doing.

Thanks

Animosity022 · March 19, 2022, 12:05pm

I'm still not understanding why you are using cat/rcat for a single file as you can't resume that fashion and you lose checksuming since you are piping as pipes are meant for merging files in general.

If it's working for you, good luck as I wouldn't backup my data that way but your data, your choice as they say.

I wouldn't expect the feature to get much traction unless you'd like to code, submit a PR as it doesn't make much sense of a use case.

asdffdsa · March 19, 2022, 2:57pm

my well stroked ego thanks you.

possible to get the md5 for each file, regardless of how the file us uploaded, streaming or not.


type d:\test\zork\1B.txt | C:\data\rclone\rclone rcat gdrive:zork/1B.txt -vv --checksum 
DEBUG : Google drive root 'zork': File to upload is small (1 bytes), uploading instead of streaming
DEBUG : 1B.txt: md5 = c4ca4238a0b923820dcc509a6f75849b OK
INFO  : 1B.txt: Copied (new)

type d:\test\zork\15MiB.file | C:\data\rclone\rclone rcat gdrive:zork/15MiB.file -vv --checksum 
DEBUG : 15MiB.file: Sending chunk 0 length 8388608
DEBUG : 15MiB.file: Sending chunk 8388608 length 7340032
DEBUG : 15MiB.file: md5 = d7ed9aa0b65b549204ca03b805800de4 OK
DEBUG : 15MiB.file: Size and md5 of src and dst objects identical

Animosity022 · March 19, 2022, 3:01pm

Right, you can do that, but why?

You can just use rclone copy on a single file and it automatically does all that.

asdffdsa · March 19, 2022, 3:08pm

i agree with you, 1000%, but the OP does not.

Animosity022 · March 19, 2022, 3:12pm

Ok as I'm trying to understand what a good use case for a single file pipe would be and I cannot as that's why I was asking.

I'll drop it though as it's not my data backing up

Manuel_F · March 19, 2022, 9:46pm

Ego recte sum et tu recte es

I always add --size with rcat, if you do that you never get the checksum in the debug. There must be a reason. Now I have to compare speeds in both cases.

Thanks one more time

asdffdsa · March 19, 2022, 11:35pm

given that gdrive does not require, why must you use --size?

system · May 18, 2022, 11:36pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.