Hashing during Upload


#1

Hi

I’m uploading large files like 1TB+ like once a week to Backblaze B2. But every time i upload such a file it stays at 0Bytes/s for like 20h. I assume that it is hashing the file at that time. As smaller files doesn’t take that long before they start uploading. (And I think I read somewhere, that that is what’s going on)

But now for my question. Can’t the byte stream that goes into the hashing function go both to the hashing function and to Backblaze (maybe other providers too?) at the same time? So that instead of reading the file twice from the hard drive it will only be read once and then split the stream into two. One for hashing and one for uploading. Se example below.

This is what I assume Rclone is doing atm:

This is what I’m suggesting it should do (if it is possible):

This way it would only be read once from the disk and hashing and uploading would be done in parallel. Which would increase performance radically for large files like this.

I do however not know if this is possible with the structure of Rclone. And I have never done something like this myself. So this might be impossible but in my head it should work. But as said, that is just done with pseudocode.

Best Regards


#2

Yes this is what is happening.

Love your diagram!

The reason rclone doesn’t do what you suggest for b2 by default is that b2 needs the sha1sum at the start of the transfer :frowning: My contact at backblaze knows this is a problem and assures me that the b2 team are working on it. It isn’t straightforward unfortunately to fix.

What I could do is make a flag similar to this for b2 (which is for s3)

  --s3-disable-checksum                Don't store MD5 checksum with object metadata

What this would do is disable the calculation of the sha1 sum for large objects in advance. This would mean that those large objects wouldn’t have an sha1sum, but they would start uploading instantly.

You can simulate this using rclone rcat b2:bucket/largefile < largefile BTW.

So if you think not having the sha1sum at b2 is an acceptable tradeoff for getting rid of the 20h wait then please please make a new issue on github - link to this thread on the forum and we’ll see what we can do!


#3

Thanks, a picture says more than 1000 words you know :wink:

I’d rather not be without the hash validation as the files are encrypted and a single bit flip would make them unusable.

I know that this might not be the best way to solve this, and that it might be better if backblaze would change the structure of the uploads to make this possible. But can’t they just add a function to change the hash after a file upload? Then rclone could upload the file with a fake hash like all zeros or what ever. And when the upload is complete, update the hash value to the correct hash. An even safer way might be to upload it without a hash value and then set it after the upload in case anything crashes while or just before updating the hash to the correct value. As that would make the file invalid when downloading it. That way backblaze could make it impossible to change the hash if it was uploaded with one or set once before. Then no one can change the hash to invalidate the files on purpose. Except for if the file was uploaded without hash and then never set afterwards. A solution to that might be for backblaze to add a wrapper for the standard upload call, that adds a flag of some sort to make it possible to set the hash after uploading. This way old code would work as usual without adding any security holes and the new code would work as expected. And incase of a problem with setting the hash after the upload the client would get an error. And the user would know that something might be wrong with the upload.

Sent from my phone, so no diagrams this time… But if you have problems following my thoughts I’ll gladly make one when I have access to my computer.


#4

Yes, that is exactly what I’ve been discussing with backblaze. Unfortunately once created, files are immutable I’m lead to understand, so this is actually quite a difficult change for them.


#5

Aha so then it all depends on how they have implemented that… Should I still put up an issue on git about this? As a remainder about it for when backblaze fixes this?


#6

Just a quick thought… What happens if you do an online copy of the file? Like could you upload a file to backblaze with a fake/no hash and then make an online copy of the file and add the correct hash during the online copy? Does it even allow you to change the hash of the file when copying? Probably not. But just in case it does, it might be a possible workaround until backblaze are done with their part.


#7

You can if you want, though I think there might be an issue about it already!

A nice idea, but backblaze don’t support server side copies or moves yet either! That means theat --backup-dir doesn’t work for rclone which is a bit of a shame.

BTW That is essentially the way you change metadata in S3 - you do a server side copy specifying new metadata.