Azure Blob Storage encryption and hashes

What is the problem you are having with rclone?

I'm not sure to have fully understood how hashing creation/comparison works with Azure Blob Storage, but there's a difference in behavior between using it as a remote destination directly and using crypt on top of it.

Consider the following example, I have two simple files on the NAS:

6740264 2017-07-04 19:08 20170704-190858-GH01C12-S00.JPG*
   6175 2017-09-15 23:33 20170704-190858-GH01C12-S00.JPG.mie*

When I run the sync directly against the Azure Blob Storage remote destination like this

rclone sync /share/Backup/Pictures/test ugisscold:test -vv

hashes are calculated and the files transferred

2020/02/18 20:09:11 INFO  : Azure container test: Waiting for checks to finish
2020/02/18 20:09:11 INFO  : Azure container test: Waiting for transfers to finish
2020/02/18 20:09:12 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: MD5 = 21b9008a5ac2cf2e39f5b85d0044b4cc OK
2020/02/18 20:09:12 INFO  : 20170704-190858-GH01C12-S00.JPG.mie: Copied (new)
2020/02/18 20:09:39 DEBUG : 20170704-190858-GH01C12-S00.JPG: MD5 = ece0f44ccb04f6930b41e44c1e360534 OK
2020/02/18 20:09:39 INFO  : 20170704-190858-GH01C12-S00.JPG: Copied (new)

At this point I'm not sure whether Azure also calculates the hashes on its side, or rclone sends them as additional metadata, but when I run a check

rclone check /share/Backup/Pictures/test ugisscold:test -vv

it works as expected checking against the hashes stored in Azure (in the ContentMD5 property)

2020/02/18 20:14:22 INFO  : Azure container test: Waiting for checks to finish
2020/02/18 20:14:23 DEBUG : 20170704-190858-GH01C12-S00.JPG: MD5 = ece0f44ccb04f6930b41e44c1e360534 OK
2020/02/18 20:14:23 DEBUG : 20170704-190858-GH01C12-S00.JPG: OK
2020/02/18 20:14:23 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: MD5 = 21b9008a5ac2cf2e39f5b85d0044b4cc OK
2020/02/18 20:14:23 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: OK
2020/02/18 20:14:23 NOTICE: Azure container test: 0 differences found
2020/02/18 20:14:23 NOTICE: Azure container test: 2 matching files

Also, this command

rclone md5sum ugisscold:test 

displays them

ece0f44ccb04f6930b41e44c1e360534  20170704-190858-GH01C12-S00.JPG
21b9008a5ac2cf2e39f5b85d0044b4cc  20170704-190858-GH01C12-S00.JPG.mie

So far, so good.

Now let's try with a crypt remote destination on top of the previous Azure Blob Storage destination.

To avoid mapping back and forth filenames, I used "filename_encryption" = "off" and "directory_name_encryption" = "false" in the configuration.

First I deleted the files and then run again the sync but this time against the new encrypted remote destination

rclone sync /share/Backup/Pictures/test ugisscoldcrypt:test -vv

files have been uploaded

2020/02/18 20:28:58 INFO  : Encrypted drive 'ugisscoldcrypt:test': Waiting for checks to finish
2020/02/18 20:28:58 INFO  : Encrypted drive 'ugisscoldcrypt:test': Waiting for transfers to finish
2020/02/18 20:28:59 INFO  : 20170704-190858-GH01C12-S00.JPG.mie: Copied (new)
2020/02/18 20:29:26 INFO  : 20170704-190858-GH01C12-S00.JPG: Copied (new)

but the hash is created only for the second, smallest, file as the following check demonstrates

rclone cryptcheck /share/Backup/Pictures/test ugisscoldcrypt:test -vv

("could not check hash" and "1 hashes could not be checked")

2020/02/18 20:33:33 INFO  : Using MD5 for hash comparisons
2020/02/18 20:33:33 INFO  : Encrypted drive 'ugisscoldcrypt:test': Waiting for checks to finish
2020/02/18 20:33:34 DEBUG : 20170704-190858-GH01C12-S00.JPG: OK - could not check hash
2020/02/18 20:33:34 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: OK
2020/02/18 20:33:34 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: OK
2020/02/18 20:33:34 NOTICE: Encrypted drive 'ugisscoldcrypt:test': 0 differences found
2020/02/18 20:33:34 NOTICE: Encrypted drive 'ugisscoldcrypt:test': 1 hashes could not be checked
2020/02/18 20:33:34 NOTICE: Encrypted drive 'ugisscoldcrypt:test': 2 matching files

Also, the following command

rclone md5sum ugisscoldcrypt:test -vv

returns UNSUPPORTED for both

                     UNSUPPORTED  20170704-190858-GH01C12-S00.JPG
                     UNSUPPORTED  20170704-190858-GH01C12-S00.JPG.mie
2020/02/18 20:35:27 Failed to md5sum with 2 errors: last error was: hash type not supported

but I double checked the ContentMD5 property with Azure Storage Explorer and it's set for the second file (otherwise rclone cryptcheck would not have worked for it).

Since the first file is bigger than the default chunk size, I decided to try using a bigger size and set it to the maximum (100MB).

So, I deleted again the files and then run the sync with the new chunk size

rclone sync /share/Backup/Pictures/test ugisscoldcrypt:test -vv --azureblob-chunk-size 100M

files have been copied exactly as before

2020/02/18 22:56:17 INFO  : Encrypted drive 'ugisscoldcrypt:test': Waiting for checks to finish
2020/02/18 22:56:17 INFO  : Encrypted drive 'ugisscoldcrypt:test': Waiting for transfers to finish
2020/02/18 22:56:17 INFO  : 20170704-190858-GH01C12-S00.JPG.mie: Copied (new)
2020/02/18 22:56:45 INFO  : 20170704-190858-GH01C12-S00.JPG: Copied (new)

however, now the ContentMD5 property is set for both and checking with

rclone cryptcheck /share/Backup/Pictures/test ugisscoldcrypt:test -vv

runs successfully ("0 differences found")

2020/02/18 23:00:10 INFO  : Using MD5 for hash comparisons
2020/02/18 23:00:10 INFO  : Encrypted drive 'ugisscoldcrypt:test': Waiting for checks to finish
2020/02/18 23:00:11 DEBUG : 20170704-190858-GH01C12-S00.JPG: OK
2020/02/18 23:00:11 DEBUG : 20170704-190858-GH01C12-S00.JPG: OK
2020/02/18 23:00:11 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: OK
2020/02/18 23:00:11 DEBUG : 20170704-190858-GH01C12-S00.JPG.mie: OK
2020/02/18 23:00:11 NOTICE: Encrypted drive 'ugisscoldcrypt:test': 0 differences found
2020/02/18 23:00:11 NOTICE: Encrypted drive 'ugisscoldcrypt:test': 2 matching files

As I said at the beginning, I may have misunderstood something, but my expectation was that using crypt on top of Azure remote would not change how hash probing is done.

Can anyone, please, explain this behavior?

What is your rclone version (output from rclone version)

V1.51.0

Which OS you are using and how many bits (eg Windows 7, 64 bit)

QNap NAS with BusyBox v1.24.1

Which cloud storage system are you using? (eg Google Drive)

Azure Blob Storage Account

I think there are several things going on here..,

The first is that crypt does not support hashes, so you'll always get UNSUPPORTED in rclone md5sum. The storage backend may store hashes, but these will be of the encrypted data.

You can use rclone cryptcheck to check these as you've discovered.

The other bit of the puzzle is how files get uploaded to azure blob. Files below this size

  --azureblob-upload-cutoff SizeSuffix   Cutoff for switching to chunked upload (<= 256MB). (default 256M)

Will get uploaded in a single chunk and will have a checksum. Files above that size will only have an MD5 checksum if the source backend has an MD5 checksum, otherwise rclone would have to read the entire file to calculate the checksum first. When uploading a crypt file it does not have a source checksum even if the source backend does as what is needed here is the encrypted md5sum which we haven't generated yet. To generate it we'd need to read the whole file, encrypt it and md5sum it.

The thing with the chunk-size I'm not 100% sure about - is one of your files exactly the size of the default chunk size?

Hopefully that explains why you are seeing what you are seeing.

Unfortunately you can't set --azureblob-upload-cutoff any bigger - 256MB is the biggest it can be.

This is a little unfortunate. Maybe crypt should generate hashes on demand for local files during uploads when asked - it would be possible to do this using very little memory, just the IO to read the file and the CPU to encrypt and hash it (this is pretty much what the local backend does when asked for an md5sum anyway). Azure Blob isn't the only backend to work like this.

Thoughts?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.