Help me figure out how to verify backup accuracy and completeness on S3

What is the problem you are having with rclone?

I'm using deja dup (which uses duplicity) to create full/incremental backups on a schedule. This works well and is very convenient, but only backs up locally.

So I want to use rclone to send these backups into an S3 Glacier bucket, since that seems like a very cost-effective and robust solution for taking the backup off site.

So far so good, however: How can I confirm that the backup files that arrive in the S3 bucket are actually correct and complete? Does rclone perhaps already do this transparently? I notice that the files I sent to S3 using rclone as a test, all have a value in a thing called eTag. It looks like this: e22bbbef23997bb271a6209637ae59c4-46. It kinda looks to me like it might be a hash of some kind?

I also noticed that at least some of the files have a metadata tag like this:
x-amz-meta-md5chksum:RPPT/eE0pm+SLZwvXgiJGw==

This surely is an MD5 hash of the file, but when is it created and how is it used?

It doesn't help if rclone is all kinds of careful before sending the file, but it gets corrupted on the way to S3. Ideally, I guess we'd want S3 to calculate the hash, and report it back to rclone for verification against a locally created hash?

I noticed that rclone has a command called checksum. I tried to experiment with it a bit, but it wants a SUM file; which isn't too unexpected given how I expect the command to work... but I can't seem to find a way to make S3 give me a file full of checksums.

I spent hours scouring the documentation and couldn't find an explanation.

Run the command 'rclone version' and share the full output of the command.

- os/version: arch "rolling" (64 bit)
- os/kernel: 6.1.22-1-lts (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.20.2
- go/linking: dynamic
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Amazon S3 Glacier deep archive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

To get files into S3. This worked fine, I can see the files in S3.

rclone copy laptop-backup/ aws-backup:backups-8011/laptop

The rclone config contents with secrets removed.

[JohanGDrive]
type = drive
client_id = 
client_secret = 
scope = drive
token = {"..."}
team_drive = 

[aws-backup]
type = s3
provider = AWS
access_key_id = 
secret_access_key = 
region = af-south-1
location_constraint = af-south-1
acl = bucket-owner-full-control
server_side_encryption = AES256
storage_class = DEEP_ARCHIVE

A log from the command with the -vv flag

I don't see a way this can be relevant, I'm trying to understand how it works, not getting an unexpected error. Let me know if more info will help.

hello and welcome to the forum,
i use rclone to upload veeam backup files to wasabi and to move older backups from wasabi to aws deep glacier.

  • for verification of entire file, rclone uses both the Etag and MD5 stored as header x-amz-meta-md5chksum
  • for verification of each chunk transferred:
    the HTTPS protocol uses checksums

rclone relies on the aws s3 official library source code.
so on that level rclone verifies in the same ways.
try -vv --dump=headers

try rclone check laptop-backup/ aws-backup:backups-8011/laptop -vv

if you want a deeper dive into how rclone handles verification, check out:
This would then enable a bullet proof end to end upload check for multipart objects

https://github.com/rclone/rclone/issues/5993#issuecomment-1041429460

Wow, lots of detail there. I couldn't possibly absorb it all, but I did try to understand the gist of it.

TL;DR: As a user, it's hard to know what rclone is doing. This makes it hard to trust it. I think a short write-up that explains what rclone is actually doing to make sure the files arrived intact in S3, will go a long way to help with this.

Long version

OK so I added a new test file that simply contains the text "hello", did the sync again but this time added -vv and --dump headers

Since I put the test file in the same directory that had my backup run in, it resulted in several thousand lines of text on the terminal. Below are the parts that look relevant.

Here are the things I don't understand: This is a log of a conversation between rclone and AWS. Who is saying what?

I'll summarize what I see when I read this.

It looks like rclone is using an HTTP PUT to upload the file, and getting the MD5 hash sZRqySSS0jR8YjW00mERhA==. But then it does HTTP HEAD to get back from S3 some info about what was uploaded... it gets back a SHA256 hash (e3b0c44298fc1c149afbf4c8996fb9...) and doesn't appear to do anything with this information. A short time later, an HTTP 200 response is logged (response to the PUT or response to the HEAD request?). Here it seems that rclone is still focused on the MD5 hash... but the hash it claims for the file at this point is different to the hash it got at first. How is that OK?

Furthermore, this last part that refers to my test file claims X-Amz-Meta-Md5chksum: 4HmatBeuICOJoX/r1gwyqg== which doesn't match either of the previously mentioned MD5 checksums.

This is difficult to trust. What on earth is it doing?

2023/04/16 19:42:10 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023/04/16 19:42:10 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023/04/16 19:42:10 DEBUG : HTTP REQUEST (req 0xc000827400)
2023/04/16 19:42:10 DEBUG : PUT /laptop/test.txt HTTP/1.1
Host: backups-8011.s3.af-south-1.amazonaws.com
User-Agent: rclone/v1.62.2
Content-Length: 6
Authorization: XXXX
Content-Md5: sZRqySSS0jR8YjW00mERhA==
Content-Type: text/plain; charset=utf-8
X-Amz-Acl: bucket-owner-full-control
X-Amz-Content-Sha256: UNSIGNED-PAYLOAD
X-Amz-Date: 20230416T174210Z
X-Amz-Meta-Mtime: 1681666912.0560934
X-Amz-Server-Side-Encryption: AES256
X-Amz-Storage-Class: DEEP_ARCHIVE
Accept-Encoding: gzip

... snip ...

2023/04/16 19:42:10 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023/04/16 19:42:10 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023/04/16 19:42:10 DEBUG : HTTP REQUEST (req 0xc0010b2800)
2023/04/16 19:42:10 DEBUG : HEAD /laptop/test.txt HTTP/1.1
Host: backups-8011.s3.af-south-1.amazonaws.com
User-Agent: rclone/v1.62.2
Authorization: XXXX
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20230416T174210Z

... snip ...

2023/04/16 19:42:10 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023/04/16 19:42:10 DEBUG : test.txt: md5 = b1946ac92492d2347c6235b4d2611184 OK
2023/04/16 19:42:10 INFO  : test.txt: Copied (new)
2023/04/16 19:42:10 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023/04/16 19:42:10 DEBUG : HTTP RESPONSE (req 0xc000b8d000)
2023/04/16 19:42:10 DEBUG : HTTP/1.1 200 OK
Content-Length: 209746964
Accept-Ranges: bytes
Content-Type: application/pgp-encrypted
Date: Sun, 16 Apr 2023 17:42:11 GMT
Etag: "e992831ba8921f876fafde8c271257ae-41"
Last-Modified: Sat, 15 Apr 2023 13:43:46 GMT
Server: AmazonS3
X-Amz-Id-2: fCjbEeHaS6sS+FU/QpemiWtbYGdMkd1wrGHzFse9ZvubVQ0cDGBD20h7gI+Mz4A7Iqlty8o8XSI=
X-Amz-Meta-Md5chksum: 4HmatBeuICOJoX/r1gwyqg==
X-Amz-Meta-Mtime: 1681564733.1501232
X-Amz-Request-Id: C8RH6A89G33CPB0Q
X-Amz-Server-Side-Encryption: AES256
X-Amz-Storage-Class: DEEP_ARCHIVE
X-Amz-Version-Id: teOipVEYIDcLbYZVEH7y5B.29C1W2j_F

please, to keeps things simple, try to pick just a single issue at a time.

with this simple example:

  • HTTP REQUEST - rclone making a GET request to aws.
    notice the request is enclosed with a series of >>> representing outbound.
  • HTTP RESPONSE - aws responds to that request with HTTP/1.1 200 OK.
    notice the response is enclosed with a series of '<<<` representing inbound.
rclone lsd wasabi01:zork --dump=headers
DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
DEBUG : HTTP REQUEST (req 0xc000988800)
DEBUG : GET /?delimiter=%2F&encoding-type=url&list-type=2&max-keys=1000&prefix= HTTP/1.1
Host: zork.s3.us-east-2.wasabisys.com
User-Agent: rclone/v1.62.2
Authorization: XXXX
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20230416T185343Z
Accept-Encoding: gzip

DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
DEBUG : HTTP RESPONSE (req 0xc000988800)
DEBUG : HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Sun, 16 Apr 2023 18:53:45 GMT
Server: WasabiS3/7.12.1004-2023-02-17-7ff2f5bdd9 (head5)
X-Amz-Bucket-Region: us-east-2
X-Amz-Id-2: Qx+7owstQV9poLN/iD3OLaCjI55NBL9hY3GTMHgMsJsv1ULQvkCHS6zqlcWt1a3k+IUohDHxx4+M
X-Amz-Request-Id: 895FD626FD6D175B

DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

OK cool, here's the single issue.

How does rclone verify that the files it sends to S3 arrived intact?

This is the hash of the body of the response. Its checked by the S3 library. In this case it is empty

$ sha256sum - </dev/null
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  -

There are actually multiple layers of verification. I'll ignore the ones in Ethernet, TCP, SSL and concentrate on the ones in the S3 API and in rclone.

S3 API

Every HTTP transaction to/from AWS has a X-Amz-Content-Sha256 or a Content-Md5 to guard against corruption of the HTTP body. The HTTP Header is protected by the Signature passed in the Authorization header.

Rclone

This needs to be divided into single part uploads and multpart uploads as the cases are different.

Single part uploads

  • Rclone uploads single part uploads with a Content-Md5 which AWS checks.
  • Rclone then does a HEAD request to read the ETag back which is the MD5 of the file and checks that

Multipart uplaods

Rclone splits the file into multiple parts for upload

  • Each part is protected with both an X-Amz-Content-Sha256 and a Content-Md5

When rclone has finished the upload of all the parts it then completes the upload by sending

  • The MD5 hash of each part
  • The number of parts
  • This info is all protected with a X-Amz-Content-Sha256

AWS checks the MD5 for all the parts and if it is good it returns OK

Rclone then does a HEAD request and checks the ETag is what it expects (in this case it should be the MD5 sum of all the MD5 sums of all the parts with the number of parts on the end).

Conclusion

So at each stage rclone and AWS are sending and checking hashes of everything. Rclone deliberately HEADs each object after upload to check it arrived safely. (You can disable this with --s3-no-head).

If that isn't enough for you then you can use rclone check to check the hashes locally vs the remote.

And if you are feeling ultimately paranoid use rclone check --download which will download the files and check them against the local copies. (Note that this doesn't use disk to do this - it streams them in memory).

That's fantastic, thank you!

I humbly suggest that this explanation will fit very well in rclone's Amazon docs page.

1 Like

thanks for replying to the OP., i was hoping you would.
better for you to summarize than myself.

1 Like

imho, that page is too long as it is, not sure it should be on the website.

in all my time in the forum, i believe i am the only rcloner that did a deep dive in how rclone verifies file transfers.
and with ncw help, we made some important improvements.

so if anyone else asks the same question you did, we can share ncw answer.

1 Like

@asdffdsa Thank you for the detailed response. Newbie with a similar question, let me know if I should make a new thread.

In my case, I need to validate that the local hash of the file matches the hash in AWS for audit fun.

Debug logging shows the hash (--log-file -vv)

2023/04/20 13:06:49 DEBUG : FULL/BIGFILE.bak: MD5 = 5564c674dc772f25d00df0347b611dc7 OK
2023/04/20 13:06:49 INFO  : FULL/BIGFILE.bak: Copied (new)
2023/04/20 13:06:49 INFO  :
Transferred:        1.024G / 1.024 GBytes, 100%, 31.593 MBytes/s, ETA 0s
Transferred:          293 / 293, 100%
Elapsed time:        33.4s

I can check that md5sum locally and it matches:

# md5sum /mnt/share/FULL/BIGFILE.bak
5564c674dc772f25d00df0347b611dc7 /mnt/share/FULL/BIGFILE.bak

But when I grab detail about the object using s3api, I get this:

$ AWS_PROFILE=myprofile  /usr/local/bin/aws s3api head-object --bucket MYBUCKET --key 'rc_cli/FULL/BIGFILE.bak'
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-04-20T18:06:19+00:00",
    "ContentLength": 1091436544,
    "ETag": "\"965d5d1fe51808e13901544f7bf9828d-209\"",
    "VersionId": "null",
    "ContentType": "application/octet-stream",
    "ServerSideEncryption": "aws:kms",
    "Metadata": {
        "md5chksum": "VWTGdNx3LyXQDfA0e2Edxw==",
        "mtime": "1681536489.94326"
    },
    "SSEKMSKeyId": "arn:aws:kms:us-east-1:BLAH:key/BLAH"
}

Amazon S3 Says that X-Amz-Meta-Md5chksum is a base64 encoded string, but when I try to decode it I get garbage:

# echo 'VWTGdNx3LyXQDfA0e2Edxw==' | base64 -d
Ud/%{ak

I tried an online converter with ASCII and UTF-8 and got the same garbage. Does this have anything to do with me using AWS:KMS encryption on the bucket? Should that decoded string match the md5 hash in the log?

You need to re-encode it as hex

$ echo 'VWTGdNx3LyXQDfA0e2Edxw==' | base64 -d | hd
00000000  55 64 c6 74 dc 77 2f 25  d0 0d f0 34 7b 61 1d c7  |Ud.t.w/%...4{a..|

You can get rclone to read these md5sums with rclone md5sum so you don't need to do this by hand. You can use rclone lsf to make a nice CSV report of them too if you want!

And you can get rclone check to check the hashes against the local hashes.

If you are using SSE with AWS:KMS this means that rclone will add the metadata md5chksum to all objects. However this is metadata uploaded with the object rather than the MD5 sum of the object that AWS actually holds. Rclone goes to every effort to ensure this is correct so it is a pretty good check using rclone check though it is more likely to detect bitrot on your local drives than in S3.

If you want the ultimate in confidence that everything you uploaded is OK then you need to download it and check it. This is what

rclone check --download 

does which will check the files are the same byte by byte by streaming them into memory and comparing them.

The penny drops. I'm assuming that's an AWS thing maybe? I remember looking for "Content-MD5" in the AWS API docs after I read the rclone s3 hashes section but I didn't pick up that nuance.

I'll check these options out, thanks for the suggestions!

May I humbly suggest that maybe my suggestion + your answer with the hex might be added somewhere to the hashes section in the s3 docs? I know that docs page is really long already, but I was immediately confused when I first figured out how to grab the hash back and it didn't match my md5sum command.

Or at the very least clarify/add that to this statement,

(in the same format as is required for Content-MD5 ).

the base64-decoded value needs to be re-encoded as hex.

Maybe something like

(in the same format as is required for Content-MD5 ). You can use base64 -d and hexdump to check this value manually
echo 'VWTGdNx3LyXQDfA0e2Edxw==' | base64 -d | hexdump
or use native commands to verify:

rclone check
rclone check --download

@ncw I made a simple PR here if you agree it's helpful:

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.