What exactly is the MD5 checksum of a file used for?

loosus456 · November 7, 2020, 9:46pm

What is the problem you are having with rclone?

Files that are very large (over 500 GB) are taking a long time to upload to Wasabi, and that seems to be caused by the calculation of the MD5 checksum. It's literally taking longer to calculate the checksum than it does to upload the file. So, I am pondering the use of --s3-disable-checksum.

If I disable the MD5 checksum, what exactly am I losing? Does rclone exclusively use the checksum to ensure the file is not corrupted in transit? I would think TCP itself is already ensuring transit integrity, right?

So, I want to know the exact disadvantages of not using the MD5 checksum. I don't want to disable it and regret it, but I also need files uploaded faster than they currently are. I'm trying to determine if the MD5 checksums are needed for my use case.

What is your rclone version (output from `rclone version`)

v1.53.2

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Windows Server 2019 Standard, 64-bit

Which cloud storage system are you using? (eg Google Drive)

Wasabi (S3)

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

"rclone" "--config" ".\\rclone.conf" "copy" "D:\\rapidrecoveryserverarchives" "Wasabi Cloud:[BUCKETNAME]/2020-11-07_1014" "--log-level" "DEBUG" "--log-file=uploadlog.txt"

The rclone config contents with secrets removed.

[Wasabi Cloud]
type = s3
provider = Wasabi
access_key_id = SECRET
secret_access_key = SECRET
region = us-east-1
endpoint = s3.wasabisys.com

A log from the command with the `-vv` flag

The log shows a message similar to the following repeatedly until the checksum has been created.

2020/11/07 11:59:47 INFO  : 
Transferred:   	   33.473G / 990.971 GBytes, 3%, 5.441 MBytes/s, ETA 2d2h3m32s
Transferred:          826 / 827, 100%
Elapsed time:    1h45m0.2s
Transferring:
 * AABackup/8cbce0bb-73b2…02761ba51/datafile.bin:  0% /957.498G, 0/s, -

asdffdsa · November 7, 2020, 10:02pm

hello and welcome to the forum,

i use rclone to upload large veeam backups files to wasabi.
my local server is the free windows 2019 server hyper-v edition using REFS file system, which is soft-raid.
it takes rclone much longer to calculate the md5 checksum of the local file then to upload that file to wasabi.

there is only one way to make sure a file is uploaded correctly.

rclone calculates the checksum of a local file named 500GB.file
rclone uploads 500GB.file to wasabi.
wasabi calculates the md5 checksum of its version of 500GB.file
rclone compares the md5 checksum of the local file 500GB.file to the md5 checksum the corresponding 500GB.file in wasabi.

loosus456 · November 8, 2020, 6:15am

Thank you for the information.

On a modern connection using TCP, would you agree that the likelihood of an unsuccessful upload is minimal even without an MD5 checksum?

Is there a way to do rolling MD5 checksums on parts of the file rather than doing it all at once?

Animosity022 · November 8, 2020, 12:21pm

You can always turn it off and it's up to you if you find it needed or not and how important your data is as that's why the option exists.

I use a crypted remoted and no checksums happen but my data also is not critical for me. If it was critical, I'd just checksums to validate.

You need the whole file to calculate a checksum.

loosus456 · November 8, 2020, 12:34pm

Isn't the TCP checksum already ensuring integrity of the file in transit, though?

Animosity022 · November 8, 2020, 12:36pm

That's making sure a packet gets to the right spot.

If you write a file to a bad block on a disk as an example, that would cause an issue even though transport was correct.

If the remote has a bug in the OS and writes bad data, a checksum would get that.

If you have bad memory on the system and the wrong data is sent, a checksum would get that.

loosus456 · November 8, 2020, 12:52pm

Ah. I see. That is the sort of info I was looking for. Thanks!

asdffdsa · November 8, 2020, 2:47pm

is there a specific reason that you need to the rclone copy process to be faster?
there might be ways to deal with that?

loosus456 · November 8, 2020, 3:00pm

It is taking so much time to upload that it's time to do another upload before the previous upload finishes.

When I turn off MD5 checksum, the problem goes away.

Right now, I'm working around it by reducing the amount we are uploading, but that is not ideal either because it means some things are not really being backed up.

asdffdsa · November 8, 2020, 3:04pm

what kind of files are you backing up?
what is your max upload speed?

loosus456 · November 8, 2020, 3:17pm

The biggest is about 1 TB. We get an average of about 95 MBytes/sec. Once the upload starts, it gets done quickly. But it spends the majority of the time calculating the MD5 checksum before it ever starts uploading.

asdffdsa · November 8, 2020, 3:40pm

you can do this in two steps.

quick rclone copy--s3-disable-checksum the new file to wasabi.
slow rclone check that new file against the same file in wasabi.

loosus456 · November 8, 2020, 3:46pm

I thought about that, but as @Animosity022 stated, I would be risking the integrity of the files if I disabled the MD5 checksum, right?

asdffdsa · November 8, 2020, 3:51pm

rclone check would ensure the integrity of the rclone copy
if something went wrong with the rclone copy, then rclone check would catch that

loosus456 · November 8, 2020, 4:04pm

I see. So the first would be fast, and then the second would be slow. That might work!

ncw · November 9, 2020, 11:38am

I think the question I'd like to ask is why is calculating the MD5SUM so slow? It should progress roughly as fast as your disk can deliver data. On my SSD laptop rclone md5sum can do about 500MB/s.

Can you do some tests with rclone md5sum on big files and calculate how many MB/s they are doing?

If you do disable --s3-disable-checksum what you are missing is the metadata on the S3 object that has the md5sum. For small objects S3 provides this as the ETag but for large objects uploaded as chunks they don't, so rclone calculates it.

Rclone provides this at the start of the upload. If it wanted to add it at the end of the upload (which would save the delay as it could be calculated streaming) then rclone could have to COPY the object to add it which takes time and costs money.

So --s3-disable-checksum only applies to large objects that are uploaded in chunks. Each chunk is uploaded with an sha1 hash which s3 checks so it is extremely unlikely that corruption could pass undetected in a multipart upload even without --s3-disable-checksum.

What you do lose with --s3-disable-checksum is the ability to do rclone check. Without the md5sum in the metadata rclone can't find out the MD5SUM of an object so can't check it properly. This would mean that you can't detect bitrot on your local disks, or if you have bad RAM which flipped a few bits on the upload.

So if you just want to be sure that the upload was OK then you can use --s3-disable-checksum just fine. However for long term archiving and full end to end checking you want it.

asdffdsa · November 9, 2020, 3:25pm

also, consider downloads.

the only way to ensure the integrity of a download, is to compare the checksum of the downloaded file against the checksum saved in wasabi.

ncw · November 9, 2020, 5:24pm

Good point, I forgot that one.

asdffdsa · November 9, 2020, 8:38pm

on a 20GB file
my laptop, intel i5, win10.64, 1 ssd, ntfs, = 300MB/s
my home server, intel i5, win server 2019, 3 hdd, refs softraid5 = 70MB/s
refs = microsoft version of zfs .

ncw · November 11, 2020, 5:25pm

That is about the speeds I'd expect SSD vs HDD.