How do we really know a file uploaded is really the same as the local flie

asdffdsa · October 17, 2019, 10:35pm

if a local file is uploaded in chunks then we cannot know that the remote file is exactly the same as the local file, true?

and if true, given i am very paranoid, i am scared and confused.

i found this item on the website:

MD5 sums are only uploaded with chunked files if the source has an MD5 sum. This will always be the case for a local to azure copy

in effect, the file is copied from local to remote and the md5sum of the local file becomes metadata like mtime, correct?

thestigma · October 17, 2019, 10:50pm

But you don't use Azure do you? Large files are handled differently pr backend...

And also I don't think regular transfer chunking is the same as parting (to overcome inherent backend limitations).
Rclone already does transfer chunking but I think that is basically like a "resume" that writes to the same destination. In that case the server should be able to make a hash.
I think "parting" (but also called chunking depending on where you read) means you actually save to multiple files to overcome a max-size limit. In that case the server often/always(?) can't make a hash for the entire thing.
This new backend does that I believe:
https://tip.rclone.org/chunker/

Besides, even if you can't get a hash on the entire file if it needs to be split, you can still hash each part and make sure they arrive unmolested. That should equally guarantee that the full file is also correct. If that logic is implemented in that backend I don't know, but it would make sense if it were. I don't see any big obstacle to doing that.

Take all of that with a gran of salt and you may have to get NCW to verify. My understanding is not complete on this topic.

Finally, don't be so paranoid. The transport-layer has basic error-detection already. the odds of there being an in-flight corruption that can't be detected except for a full hash is pretty darn low from a mathematical/statistic perspective. hashes have many utilities, but I wouldn't say they are required to have stable and error-free transfers. Not unless it's truly mission-critical data.

asdffdsa · October 17, 2019, 10:58pm

thanks for the reply but you are being very theoretical about it and i am aware of all that you wrote.

if the cloud provider does not do a checksum on the remote file, what we upload cannot be verified to be the same as the original local file.

is there a cloud provider that does its own checksum of uploaded files and that checksum can be compared to the md5sum of the local file?

thestigma · October 17, 2019, 11:01pm

I'm pretty sure this is what normally happens, for all "normal" files - ie. the ones that do not need special handling due to size limitations. That's where hashing stuff becomes a little trickier.

On Gdrive which I use, the max size is very large, so all those hashes should be generated by the server (presumably calculated in a rolling fashion as they are received). This is generally how filesystems which include hashing metadata work - and that's basically what cloud-backends use. (it's not common in end-user systems but you could run it even on a Windows system if you were willing to use a different filesystem).

asdffdsa · October 17, 2019, 11:03pm

hopefully @ncw or someone can provide clarity on this most important topic?

Animosity022 · October 17, 2019, 11:14pm

You can use:

https://rclone.org/commands/rclone_check/

asdffdsa · October 17, 2019, 11:24pm

thanks but as i understand it, rclone check just recalculates the md5 of the local file and compares that to the metadata of the remote file.

if the md5 of the remote file is just a copy of the md5 of the local file when rclone uploaded that file to the remote then how do we know that the local file matches the remote file?

are there providers that calculate their own checksums?
and if so, can rclone compare the local md5 against the md5 as calculated by the remote provider?

thanks much,

Animosity022 · October 17, 2019, 11:27pm

Once the file is uploaded, the other side calculates a md5 on the file and stores it.

It compares the local md5 to the remote md5 on the provider so if they match, it's the same file.

If you read a bit further down, you can use --download and compare the file all the way.

If you supply the --download flag, it will download the data from both remotes and check them against each other on the fly. This can be useful for remotes that don’t support hashes or if you really want to check all the data.

asdffdsa · October 17, 2019, 11:31pm

thanks, i am aware of the --download flag.

but as per this:
MD5 sums are only uploaded with chunked files if the source has an MD5 sum. This will always be the case for a local to azure copy

to me, rclone takes the md5 of the local file and turns that into a metadata of the remote file.

we could imagine a bug in rclone that miscalculates the md5 of the local file and turns that into a md5 metadata of the remote file and in that case, we have no way to know the md5 of the real remote file.

Animosity022 · October 17, 2019, 11:34pm

If you want to play the 'imagine a bug scenario', this becomes an open ended question with no answer.

https://rclone.org/commands/rclone_md5sum/

You can run md5sum on the source and destination files and compare.
You can download to compare as above.

Those validate the files match if they are that important.

asdffdsa · October 17, 2019, 11:40pm

i am not playing games and i am asking a valid question, which so far has not been answered.

my point is to understand how the md5 of the remote file is calculated?

is it just a copy of the local md5 turned into a metadata tag by rclone or is the md5 actually calculated by the cloud provider independent of rclone.

if i backup important data, i just want to understand what is really going on.

Animosity022 · October 17, 2019, 11:45pm

Depending on the provider, the provider calculates the md5sum.

Hashes for all the providers are listed here:

https://rclone.org/overview/

Is there a specific provider you have a question about?

asdffdsa · October 17, 2019, 11:51pm

thanks, i have looked at that webpage before and i am confused.

on the one hand, that page states that azure blob uses md5

on the other hand from rclone website, we have MD5 sums are only uploaded with chunked files if the source has an MD5 sum. This will always be the case for a local to azure copy

so it seems to me that rclone is just copying the md5 of the local file to metadata of the remote file.
so in effect, azure is not calculating its own md5.

i just want to know what rclone is doing with md5?
i would like to know which cloud providers calculate the checksum of upload files, not relying on rclone to do that?

thanks

Animosity022 · October 17, 2019, 11:56pm

Is the question, how does md5sum work from source to destination with Azure Blob Storage?

thestigma · October 17, 2019, 11:57pm

But this is a spesific exception to Azure as I noted, not the general rule. It's specifically under the "limitations" section for the azure backend. You're on Wasabi aren't you, so is this even relevant to you?

As both me and Ani have already said - the normal way is that the hash for the transferred file is generated on the server-side. You then compare it to local-side. If they match they must be bit-identical.

EDIT: If the question is spesific to Azure (which hasn't really been clear), then I can't really tell you any more than what the documentation states. The phrasing would seem to indicate that it does copy the metadata from local - presumably due to some technical limitation of Azure.

asdffdsa · October 17, 2019, 11:58pm

sure, i would like to know, please share.

asdffdsa · October 18, 2019, 12:05am

well, wasabi has been unreliable for many months now as per

so i am looking for another option and since i use azure for VM, i thought i would check out it blob storage.

and thus the page, https://rclone.org/overview/, is not helpful to know if the server calculates md5 or not and we cannot know what you call the normal way?

so is there a way to know, per provider, if it is rclone calculating the local md5 and turning that into a metadata versus the cloud provider, on it own, calculating the md5sum?

thanks

thestigma · October 18, 2019, 12:08am

Any other backend that does not specifically list it as a limitation... ie. most of them.
The limitation may be due it being a blob-type storage. Do we have any more of those in the backend-list? May be worth checking if that is a thing.

And of course, Azure will do this too for non-chunked files. The limitation is spesific to chunking according to the docs.

asdffdsa · October 18, 2019, 12:12am

you mentioned that you use google drive.
do you know how md5 for remote files are calculated?

rclone copying md5 of local file to remote metadata?
or
google calculated?

thestigma · October 18, 2019, 12:16am

This. AKA server-side.
I'm not sure why you refuse to trust me on this - this is how it's normally done UNLESS there are spesific limitations and workarounds needed (typically to work around some maximum file-size restriction in either the file-system or the backend servers).

I guess you have to either trust that the backend implementors will note such limitations when they exist - or else do some research on your own and verify that the systems a given provider candidate use can support a true max-filesize (ie. without parting) that is above what you realistically require. If they do then they should never need to do any workarounds related to this.