Etag/md5 calculation

Rob1king · April 30, 2020, 1:59pm

Hello all.

I am new to rclone and don't actually utilize the software at this time (another team member handles that aspect)

We presently utilize rclone to move data files to an s3 appliance from our fast storage. It is installed in a container and up until today was at version 1.50.2 (upgraded over time the last several years).

The issue I am having may seem trivial, but is a task I have been assigned.

I am attempting to write a python script that calculates the etag/md5 on the file to compare to the one generated when the file was migrated to the s3.

Reading through the code on github has helped me determine that it looks like it follows s3 'standard' with a filesize cutoff of 5GB. I seem to be able to calculate the etag/md5 correctly using 16MB chunksize until around 4.7ish GB and then it breaks down.

I saw this calculation - partSize = int((((size / maxUploadParts) >> 20) + 1) << 20) and attempted to implement it in my code, but am not getting correct results.

I see the chunker.go but I am a novice with go and haven't been able to interpret/understand it yet.

I've seen in the forum that people state that this chunk size is calculated dynamically, but no explanation as to how this is done (or I have missed the explanation thus far).

So, my question is how does rclone dynamically determine chunk size for files?

Any help here would be greatly appreciated as these files have originals on tape and I'd like to calculate this to determine if there has been corruption on of data on the s3 appliance before we migrate them to a new location (we have found instances of this randomly already).

Thank you all in advance for your time and help.

asdffdsa · April 30, 2020, 3:13pm

hello and welcome to the forum,

there are many such solutions, you can search the net for them.

this one has a python script

Rob1king · April 30, 2020, 3:37pm

I've already looked at and tested that code. It doesn't return the correct etag/md5.

I ran into this problem with goofys a few months ago and found in their code that they had a 'dynamic' chunk style that was coded for the chunksize for first 1000 chunks was 5MB, the second 1000 chunks was 25MB and for chunks 2001 thru 10000 was 125MB.

I am assuming that rclone is doing something similar and that is what I need to know. The generic answers out there don't work for these use cases that I have found.

Thank you for your answer and suggestion though.

asdffdsa · April 30, 2020, 3:41pm

rclone has commands to calculate checksums, are you not able to use that?

Rob1king · April 30, 2020, 3:45pm

I am attempting to independently calculate it, as I had to when goofys was used. The files have already been migrated and I am checking that their integrity is correct before we re-migrate them to a new storage resource.

So, no I am not able to use the rclone commands to calculate it.

Or are you saying that the checksum commands can be used without performing any data movement?

Rob1king · May 1, 2020, 2:55pm

Okay, so I figured out that (if I am understanding the code base) That files up to a size of 48GB use the default chunk size of 5MB. After that the bitwise calculation determines the chunksize dynamically to keep the parts under 10000.

Now my question is for the relatively small sized files....

What is the file size cutoff where the file's file size stops being the chunksize? Is it 128MB?

I've found that using 4GB as a cutoff and calculating those files etag with a 16MB chunksize seems to work, but seems like a guess rather than what rclone is doing.

Any insights?

Thank you.

ncw · May 1, 2020, 3:45pm

Rclone stars with the default chunk size as set by --s3-chunk-size.

If that makes less than 10,000 parts it goes with that chunk size.

If not then it increases the chunk size to the smallest it needs to be but rounded up to the nearest 1MiB and uses that.

This is implemented here

github.com

rclone/rclone/blob/b52a39a84ecd5ec94d98ab424e989a9f0df8936a/backend/s3/s3.go#L2192-L2196


      
          		// Adjust partSize until the number of parts is small enough.
          		if size/int64(partSize) >= maxUploadParts {
          			// Calculate partition size rounded up to the nearest MB
          			partSize = int((((size / maxUploadParts) >> 20) + 1) << 20)
          		}

In that partSize is 5MiB by default (this can be changed)

Rclone uses the same size chunks for the whole transfer so once you've worked out the chunk size you should be good to go.

Here is that in python

#!/usr/bin/env python3
"""
Calculate the chunk size rclone uses for a given file size
"""

max_upload_parts = 10000

def part_size(size, chunk_size=5*1024*1024):
    size = int(size)
    part_size = chunk_size
    # Adjust part_size until the number of parts is small enough.
    if size/part_size >= max_upload_parts:
	    # Calculate partition size rounded up to the nearest MB
	    part_size = ((int(size / max_upload_parts) >> 20) + 1) << 20
    return part_size

def main():
    for size in (0,1E9,5*1024*1024*10000-1,5*1024*1024*10000):
        print(size, part_size(size))

if __name__ == "__main__":
    main()

Files below --s3-upload-cutoff will be uploaded as a single part. The ETag in this case should just be the normal md5sum (with no - characters in).

That is correct. The exact threshold is files >= 5*1024*1024*10000 bytes will have their chunk size increased (assuming default chunk size of 5MB).

That should be the --s3-upload-cutoff and their md5sums should be equal to their ETags so no fancy calculations needed. They are very easy to tell apart..

All this does depend on what --s3-chunk-size was when the upload was done...

If you are having success with 16MiB then those might have been uploaded with a different chunk size. It is possible earlier versions of rclone used a different chunk size, but not within the last 2 years.

In my opinion the ETag is almost useless as a file integrity check (unless it is a plain MD5) as s3 doesn't record anything about the chunks used for the upload.

Note also that rclone will store as metadata on the s3 object an actual md5sum recorded at the time of upload if the ETag isn't an md5sum. You could use that potentially? It is stored base64 encoded on the metadata key Md5chksum

system · July 1, 2020, 11:47am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.