S3 Bucket migration with metadata issues

Hi everyone,

I'm in the progress of migrating data (90TB total) from and to a specific S3 bucket on an on-premise S3-compatible storage (Cloudian).
Reason is that nodes will be added and I will have to change the storage policy to a more space friendly one (RF3 to Erasure Coding) and it's not possible to change the policy on the fly, only when creating a new bucket.

Since my backup software, which writes to S3, doesn't allow the migration between buckets I need to do it on the S3 side.
Basically the flow will be as follows:
Original S3 Bucket -> Temporary S3 Bucket -> New S3 Bucket (with original name)

I've used different tools with different forms of success, for 1 or another reason although aws cli syncs everything correctly there seems something to be wrong as my backup software can't seem to write anything afterwards and I haven't found out yet why it does that as everything from permissions to metadata looks to be correct.

Using rclone the migration seems to work, but I found 2 quirks if you'd like to call it that.

  • When doing a sync between S3 buckets on the same S3 storage metadata syncs correctly, however when doing the incremental the setting of the modtime metadata makes the rest of the metadata disappear. using --no-update-modtime solves this and metadata stays ok. Is this expected behaviour when not using this flag?

  • When doing a sync between 2 S3 buckets on different S3 compatible storage (both Cloudian) metadata doesn't get synced no matter what I try.

Last question, when doing a sync I get the following notice:
NOTICE: S3 bucket migrate-rbrk-rubrik-0: --checksum is in use but the source and destination have no hashes in common; falling back to --size-only
Is this notice for files which were uploaded as a multi-part because of size or is it for all files? I haven't found a clear answer for this yet.

Hmm... I just had a look at the code. setting the modtime shoud preserve the metadata. If you could try rclone touch s3:bucket/file -vv --dump bodies and post the result that would help debug.

That is expected. If you are doing a sync between different cloud storage systems, rclone can't do server side copies.

It would be possible to fix it relatively easily though (I'm sure there is an issue somewhere about that!).

Two S3 remotes should have MD5SUM in common regardless of whether the files were uploaded as big files or not... Are they all plain S3 remotes (no crypt)? Can you post the log with -vv up to that message?

I've added a test header X-Amz-Meta-Rclone-Test-Header to a file, when I run rclone touch it reads the headers correct and seems to include them in the PUT request but as you can see on the second command the header disappeared.

    mmassez@ubuntu:~$ rclone touch cloudianlab:migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt -vv --dump bodies
    2019/05/28 10:02:52 DEBUG : rclone: Version "v1.47.0" starting with parameters ["rclone" "touch" "cloudianlab:migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt" "-vv" "--dump" "bodies"]
    2019/05/28 10:02:52 DEBUG : Using config file from "/home/mmassez/.config/rclone/rclone.conf"
    2019/05/28 10:02:52 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:02:52 DEBUG : HTTP REQUEST (req 0xc000347d00)
    2019/05/28 10:02:52 DEBUG : HEAD /migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt HTTP/1.1
    Host: s3-lab.cloudian.bigteclab.be
    User-Agent: rclone/v1.47.0
    Authorization: XXXX
    X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    X-Amz-Date: 20190528T100252Z

    2019/05/28 10:02:52 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:02:52 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:02:52 DEBUG : HTTP RESPONSE (req 0xc000347d00)
    2019/05/28 10:02:52 DEBUG : HTTP/1.1 200 OK
    Content-Length: 32
    Accept-Ranges: bytes
    Content-Type: application/octet-stream
    Date: Tue, 28 May 2019 10:02:52 GMT
    Etag: "76363514e3199377f307824bde7992e8"
    Last-Modified: Tue, 28 May 2019 07:28:56 GMT
    Server: CloudianS3
    X-Amz-Meta-Rclone-Test-Header: Now it's here
    X-Amz-Request-Id: aedfae28-b216-104f-8dc9-001e67b20a50

    2019/05/28 10:02:52 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:02:52 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:02:52 DEBUG : HTTP REQUEST (req 0xc000114400)
    2019/05/28 10:02:52 DEBUG : PUT /migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt HTTP/1.1
    Host: s3-lab.cloudian.bigteclab.be
    User-Agent: rclone/v1.47.0
    Content-Length: 0
    Authorization: XXXX
    Content-Type: application/octet-stream
    X-Amz-Acl: private
    X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    X-Amz-Copy-Source: migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt
    X-Amz-Date: 20190528T100252Z
    X-Amz-Meta-Mtime: 1559037772.321371647
    X-Amz-Meta-Rclone-Test-Header: Now it's here
    X-Amz-Metadata-Directive: REPLACE
    Accept-Encoding: gzip

    2019/05/28 10:02:52 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:02:52 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:02:52 DEBUG : HTTP RESPONSE (req 0xc000114400)
    2019/05/28 10:02:52 DEBUG : HTTP/1.1 200 OK
    Transfer-Encoding: chunked
    Content-Type: application/xml;charset=UTF-8
    Date: Tue, 28 May 2019 10:02:52 GMT
    Server: CloudianS3
    X-Amz-Request-Id: aedfae2a-b216-104f-8dc9-001e67b20a50

    b9
    <?xml version="1.0" encoding="UTF-8"?><CopyObjectResult><LastModified>2019-05-28T07:28:56.482Z</LastModified><ETag>&quot;76363514e3199377f307824bde7992e8&quot;</ETag></CopyObjectResult>
    0

    2019/05/28 10:02:52 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:02:52 DEBUG : 4 go routines active
    2019/05/28 10:02:52 DEBUG : rclone: Version "v1.47.0" finishing with parameters ["rclone" "touch" "cloudianlab:migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt" "-vv" "--dump" "bodies"]

Second run of touch on that file.

    mmassez@ubuntu:~$ rclone touch cloudianlab:migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt -vv --dump bodies
    2019/05/28 10:03:37 DEBUG : rclone: Version "v1.47.0" starting with parameters ["rclone" "touch" "cloudianlab:migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt" "-vv" "--dump" "bodies"]
    2019/05/28 10:03:37 DEBUG : Using config file from "/home/mmassez/.config/rclone/rclone.conf"
    2019/05/28 10:03:37 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:03:37 DEBUG : HTTP REQUEST (req 0xc00030bf00)
    2019/05/28 10:03:37 DEBUG : HEAD /migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt HTTP/1.1
    Host: s3-lab.cloudian.bigteclab.be
    User-Agent: rclone/v1.47.0
    Authorization: XXXX
    X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    X-Amz-Date: 20190528T100337Z

    2019/05/28 10:03:37 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:03:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:03:37 DEBUG : HTTP RESPONSE (req 0xc00030bf00)
    2019/05/28 10:03:37 DEBUG : HTTP/1.1 200 OK
    Content-Length: 32
    Accept-Ranges: bytes
    Content-Type: application/octet-stream
    Date: Tue, 28 May 2019 10:03:37 GMT
    Etag: "76363514e3199377f307824bde7992e8"
    Last-Modified: Tue, 28 May 2019 07:28:56 GMT
    Server: CloudianS3
    X-Amz-Request-Id: aedfae2c-b216-104f-8dc9-001e67b20a50

    2019/05/28 10:03:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:03:37 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:03:37 DEBUG : HTTP REQUEST (req 0xc0003a4200)
    2019/05/28 10:03:37 DEBUG : PUT /migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt HTTP/1.1
    Host: s3-lab.cloudian.bigteclab.be
    User-Agent: rclone/v1.47.0
    Content-Length: 0
    Authorization: XXXX
    Content-Type: application/octet-stream
    X-Amz-Acl: private
    X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    X-Amz-Copy-Source: migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt
    X-Amz-Date: 20190528T100337Z
    X-Amz-Meta-Mtime: 1559037817.256385028
    X-Amz-Metadata-Directive: REPLACE
    Accept-Encoding: gzip

    2019/05/28 10:03:37 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    2019/05/28 10:03:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:03:37 DEBUG : HTTP RESPONSE (req 0xc0003a4200)
    2019/05/28 10:03:37 DEBUG : HTTP/1.1 200 OK
    Transfer-Encoding: chunked
    Content-Type: application/xml;charset=UTF-8
    Date: Tue, 28 May 2019 10:03:37 GMT
    Server: CloudianS3
    X-Amz-Request-Id: aedfae2e-b216-104f-8dc9-001e67b20a50

    b9
    <?xml version="1.0" encoding="UTF-8"?><CopyObjectResult><LastModified>2019-05-28T07:28:56.482Z</LastModified><ETag>&quot;76363514e3199377f307824bde7992e8&quot;</ETag></CopyObjectResult>
    0

    2019/05/28 10:03:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    2019/05/28 10:03:37 DEBUG : 4 go routines active
    2019/05/28 10:03:37 DEBUG : rclone: Version "v1.47.0" finishing with parameters ["rclone" "touch" "cloudianlab:migrate-rbrk-rubrik-0/rubrik_encryption_key_check.txt" "-vv" "--dump" "bodies"]

That explains, I was looking at different migration scenarios as the space is limited, having a temporary S3 target would have been nice but metadata needs to be synced as well to do this. I looked through the code a bit but since I'm not a coder it takes a little more time.

They are encrypted but they are put as an encrypted file with the metadata containing values like unencrypted content length, iv and key. So they are client-side encrypted, not server-side.
I've tested with a regular ISO and performed 2 syncs to a different bucket. It seems that I get the notice only the second time when the file exists in both buckets.

As far as i understood the ETag header should be the md5sum of the file?
ETag of original file: "00704962b5cd9c313fb03f312dfe104d-115"
ETag of copy: "bd43d41e01c2a46b3cb23eb9139dce4b"

Initial transfer of new file (CentOS ISO):

    mmassez@ubuntu:~$ rclone sync cloudianlab:test-rubrik-0 cloudianlab:mig-test-rubrik-0 -P --delete-during --no-update-modtime --transfers 32 -c -vv |tee synclog.txt
    2019/05/28 10:15:37 DEBUG : rclone: Version "v1.47.0" starting with parameters ["rclone" "sync" "cloudianlab:test-rubrik-0" "cloudianlab:mig-test-rubrik-0" "-P" "--delete-during" "--no-update-modtime" "--transfers" "32" "-c" "-vv"]
    2019/05/28 10:15:37 DEBUG : Using config file from "/home/mmassez/.config/rclone/rclone.conf"
    2019-05-28 10:15:37 INFO  : Waiting for deletions to finish
    2019-05-28 10:15:37 DEBUG : rubrik_cluster_lock.txt: Size and MD5 of src and dst objects identical
    2019-05-28 10:15:37 DEBUG : rubrik_cluster_lock.txt: Unchanged skipping
    2019-05-28 10:15:37 DEBUG : rubrik_encryption_key_check.txt: Size and MD5 of src and dst objects identical
    2019-05-28 10:15:37 DEBUG : rubrik_encryption_key_check.txt: Unchanged skipping
    2019-05-28 10:15:37 INFO  : S3 bucket mig-test-rubrik-0: Waiting for checks to finish
    2019-05-28 10:15:37 INFO  : S3 bucket mig-test-rubrik-0: Waiting for transfers to finish
    2019-05-28 10:15:56 INFO  : CentOS-7-x86_64-Minimal-1810.iso: Copied (server side copy)
    2019/05/28 10:15:56 DEBUG : 6 go routines active
    2019/05/28 10:15:56 DEBUG : rclone: Version "v1.47.0" finishing with parameters ["rclone" "sync" "cloudianlab:test-rubrik-0" "cloudianlab:mig-test-rubrik-0" "-P" "--delete-during" "--no-update-modtime" "--transfers" "32" "-c" "-vv"]

Resync of the 2 buckets:

    mmassez@ubuntu:~$ rclone sync cloudianlab:test-rubrik-0 cloudianlab:mig-test-rubrik-0 -P --delete-during --no-update-modtime --transfers 32 -c -vv |tee synclog.txt
    2019/05/28 10:16:04 DEBUG : rclone: Version "v1.47.0" starting with parameters ["rclone" "sync" "cloudianlab:test-rubrik-0" "cloudianlab:mig-test-rubrik-0" "-P" "--delete-during" "--no-update-modtime" "--transfers" "32" "-c" "-vv"]
    2019/05/28 10:16:04 DEBUG : Using config file from "/home/mmassez/.config/rclone/rclone.conf"
    2019-05-28 10:16:04 INFO  : Waiting for deletions to finish
    2019-05-28 10:16:05 DEBUG : rubrik_cluster_lock.txt: Size and MD5 of src and dst objects identical
    2019-05-28 10:16:05 DEBUG : rubrik_cluster_lock.txt: Unchanged skipping
    2019-05-28 10:16:05 DEBUG : rubrik_encryption_key_check.txt: Size and MD5 of src and dst objects identical
    2019-05-28 10:16:05 DEBUG : rubrik_encryption_key_check.txt: Unchanged skipping
    2019-05-28 10:16:05 INFO  : S3 bucket mig-test-rubrik-0: Waiting for checks to finish
    2019-05-28 10:16:05 NOTICE: S3 bucket mig-test-rubrik-0: --checksum is in use but the source and destination have no hashes in common; falling back to --size-only
    2019-05-28 10:16:05 DEBUG : CentOS-7-x86_64-Minimal-1810.iso: Size of src and dst objects identical
    2019-05-28 10:16:05 DEBUG : CentOS-7-x86_64-Minimal-1810.iso: Unchanged skipping
    2019-05-28 10:16:05 INFO  : S3 bucket mig-test-rubrik-0: Waiting for transfers to finish
    2019/05/28 10:16:05 DEBUG : 6 go routines active
    2019/05/28 10:16:05 DEBUG : rclone: Version "v1.47.0" finishing with parameters ["rclone" "sync" "cloudianlab:test-rubrik-0" "cloudianlab:mig-test-rubrik-0" "-P" "--delete-during" "--no-update-modtime" "--transfers" "32" "-c" "-vv"]

That has surely got to be a bug in Cloudian... It didn't preserve your metadata or rclone's metadata.

It works fine if I try it against s3

$ aws s3api put-object --bucket rclone-test1 --key test.txt --body test.txt --metadata '{"x-amz-meta-hello":"potato"}'
{
    "ETag": "\"b1946ac92492d2347c6235b4d2611184\""
}
$ aws s3api head-object --bucket rclone-test1 --key test.txt
{
    "AcceptRanges": "bytes",
    "LastModified": "Wed, 29 May 2019 08:58:06 GMT",
    "ContentLength": 6,
    "ETag": "\"b1946ac92492d2347c6235b4d2611184\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {
        "x-amz-meta-hello": "potato"
    }
}
$ rclone touch s3:rclone-test1/test.txt
$ aws s3api head-object --bucket rclone-test1 --key test.txt
{
    "AcceptRanges": "bytes",
    "LastModified": "Wed, 29 May 2019 08:58:47 GMT",
    "ContentLength": 6,
    "ETag": "\"b1946ac92492d2347c6235b4d2611184\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {
        "x-amz-meta-hello": "potato",
        "mtime": "1559120326.076664788"
    }
}

One way to test the compatibility would be to run the rclone test suite against it. You'd need to install go, download the rclone source then cd backend/s3 then go test -v -remote cloudianlab: for the basic integration tests. This will create and destroy a few randomly named buckets, eg rclone-test-boceqog0nigayij0qoyoget1.

I had a quick go at this here - let me know what you think

https://beta.rclone.org/branch/v1.47.0-089-gef16b335-fix-111-metadata-beta/ (uploaded in 15-30 mins)

I think it could be generalised a bit more for copy from gcs -> s3 for instance it only works for s3 -> s3 at the moment.

The original was uploaded as a multipart upload so doesn't have a regular MD5SUM (see the -115 on the end). rclone puts the MD5SUM as metadata, but I guess it wasn't uploaded by rclone, hence the message.

I've tested it as well and doesn't work on Cloudian. I've contacted one of the engineers and asked if he could check internally.

I did run the test but with mixed results, I got the following output:
https://pastebin.com/48pVpJvv

I'll have to compile it for linux as I use an ubuntu server for doing the heavy lifting.

Indeed, first upload wasn't done via rclone. I'll keep that in mind.

Yes that looks like the same problem - modtime metadata not being added.

The builder should have made a linux binary - I gave it a kick - check back in 30 mins!

I've tested the fix you added to copy metadata between s3 targets and it seems to work.
Metadata is transferred between both and from my initial testing a resync seems to preserve metadata as well.

EDIT: Everything that is a multi-part copy doesn't sync metadata no matter if I use the --no-update-modtime or not

EDIT2: That explains the problems with aws s3 cli as well, metadata copy doesn't work on multipart copies, even server side. I had a windows tool S3Browser which did copy the metadata although it was multipart as well.

Didn't get any feedback yet from Cloudian but will let you know as soon as I get some info.

Thanks for the help already!

Great!

I tried copying all 4 combinations to and from multipart/non-multipart and they all preserved the metadata for me.

Note that rclone won't update metadata on an object that already exists - that might be the problem.

I've tested by doing a sync of an iso (+-800Mb) to and from AWS always emptying the destination to make sure it's started fresh.
Cloudian -> AWS Metadata OK ( https://pastebin.com/7iNr6zAB )
AWS -> Cloudian Metadata NOK ( https://pastebin.com/YThjXN9S )
AWS -> Cloudian (Non-multipart small file) Metadata OK ( https://pastebin.com/u4xxBWsP )

So I think this is a bug as well in Cloudian since smaller files to and from AWS works.
For the other bug they are raising an internal ticket to check this out, I'll add this to the list as well.

Thanks alot for the help!

Ah Ok! that makes sense :smiley:

I'm going to try to generalise the metadata stuff a bit so it works across cloud platforms

All of these are fixed in the latest Cloudian HyperStore 7.1.5 release and up.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.