Current state of metadata support, metadata filtering

What is the problem you are having with rclone?

I'm trying to figure out what the current state of metadata support is, both on Openstack Swift and on S3. I understand that Swift metadata support may be limited or non-existent (although it seems like rclone is reading some metadata as part of the --swift-no-large-objects flag), so I'm trying first to step back and see if I can get these commands working on an S3-compatible service (Cloudflare R2).

What I'm trying to figure out:

If I can get this working, what I ultimately want to know:

  • is it possible to sync any metadata from Swift to R2 (i.e. the X-Object-Manifest header)?
  • is it possible to filter based on the presence of a specific header on Swift (i.e. if the X-Object-Manifest header is present) — the fact that this header factors in to --swift-no-large-objects is what makes me curious :thinking:

Run the command 'rclone version' and share the full output of the command.

$ rclone --version
rclone v1.63.1
- os/version: ubuntu 22.04 (64 bit)
- os/kernel: 4.4.0-1104-aws (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.20.6
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

  • OVH (Openstack Swift)
  • Cloudflare R2 (S3-compatible)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

As a basic sanity check, I'm trying to get rclone to include a file based on its etag header. In my test bucket (rclone-debugging) exist two files:

  • testfile1.txt with an Etag header value of "59bcc3ad6775562f845953cf01624225"
  • testfile2.txt with an Etag header value of "0f18fd4cf40bfb1dec646807c7fa5522"

To start:

$ rclone lsf r2-development:rclone-debugging
testfile1.txt
testfile2.txt

So far so good. Also:

$ rclone lsf r2-development:rclone-debugging -M --metadata-include "*"
testfile1.txt
testfile2.txt

Now I'm trying to match one of those known Etag headers a few different ways. None of the following commands return any files at all. In the last command I'm simply trying to match on the existence of the string etag.

$ rclone lsf r2-development:rclone-debugging -M --metadata-include "59bcc3ad6775562f845953cf01624225"
$ rclone lsf r2-development:rclone-debugging -M --metadata-include "*59bcc3ad6775562f845953cf01624225*"
$ rclone lsf r2-development:rclone-debugging -M --metadata-include "{{(?i).+59bcc3ad6775562f845953cf01624225.+}}"
$ rclone lsf r2-development:rclone-debugging -M --metadata-filter "+ *59bcc3ad6775562f845953cf01624225*" --metadata-filter "- *"
$ rclone lsf r2-development:rclone-debugging -M --metadata-include "{{(?i).+etag.+}}"

The rclone config contents with secrets removed.

[r2-development]
type = s3
provider = Cloudflare
env_auth = true
endpoint = XXXXXXXXXXXXXXXXXX

A log from the command with the -vv flag

$ rclone lsf r2-development:rclone-debugging -M --metadata-include "*59bcc3ad6775562f845953cf01624225*" --dump filters -vv
2023/07/25 19:45:52 DEBUG : Setting --config "/app/config/rclone.conf" from environment variable RCLONE_CONFIG="/app/config/rclone.conf"
--- start filters ---
--- File filter rules ---
--- Directory filter rules ---
--- Metadata filter rules ---
+ (^|/)[^/]*59bcc3ad6775562f845953cf01624225[^/]*$
- ^.*$
--- end filters ---
2023/07/25 19:45:52 DEBUG : rclone: Version "v1.63.1" starting with parameters ["rclone" "lsf" "r2-development:rclone-debugging" "-M" "--metadata-include" "*59bcc3ad6775562f845953cf01624225*" "--dump" "filters" "-vv"]
2023/07/25 19:45:52 DEBUG : Creating backend with remote "r2-development:rclone-debugging"
2023/07/25 19:45:52 DEBUG : Using config file from "/app/config/rclone.conf"
2023/07/25 19:45:52 DEBUG : name = "r2-development", root = "rclone-debugging", opt = &s3.Options{Provider:"Cloudflare", EnvAuth:true, AccessKeyID:"", SecretAccessKey:"", Region:"", Endpoint:"XXXXXXXXXXXXXXXXX", STSEndpoint:"", LocationConstraint:"", ACL:"", BucketACL:"", RequesterPays:false, ServerSideEncryption:"", SSEKMSKeyID:"", SSECustomerAlgorithm:"", SSECustomerKey:"", SSECustomerKeyBase64:"", SSECustomerKeyMD5:"", StorageClass:"", UploadCutoff:209715200, CopyCutoff:4999341932, ChunkSize:5242880, MaxUploadParts:10000, DisableChecksum:false, SharedCredentialsFile:"", Profile:"", SessionToken:"", UploadConcurrency:4, ForcePathStyle:true, V2Auth:false, UseAccelerateEndpoint:false, LeavePartsOnError:false, ListChunk:1000, ListVersion:0, ListURLEncode:fs.Tristate{Value:false, Valid:false}, NoCheckBucket:false, NoHead:false, NoHeadObject:false, Enc:0x3000002, MemoryPoolFlushTime:60000000000, MemoryPoolUseMmap:false, DisableHTTP2:false, DownloadURL:"", DirectoryMarkers:false, UseMultipartEtag:fs.Tristate{Value:false, Valid:false}, UsePresignedRequest:false, Versions:false, VersionAt:fs.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Decompress:false, MightGzip:fs.Tristate{Value:false, Valid:false}, UseAcceptEncodingGzip:fs.Tristate{Value:false, Valid:false}, NoSystemMetadata:false}
2023/07/25 19:45:52 DEBUG : Resolving service "s3" region "us-east-1"
2023/07/25 19:45:52 DEBUG : testfile1.txt: Excluded
2023/07/25 19:45:53 DEBUG : testfile2.txt: Excluded

And just in case it's relevant, here's an excerpt of the previous command, with --dump headers appended to it:

2023/07/25 19:48:54 DEBUG : HTTP REQUEST (req 0xc000902300)
2023/07/25 19:48:54 DEBUG : HEAD /rclone-debugging/testfile1.txt HTTP/1.1
Host: XXXXXXXXXXXXXXXXXXXX
User-Agent: rclone/v1.63.1
Authorization: XXXX
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20230725T194854Z

2023/07/25 19:48:54 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023/07/25 19:48:54 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023/07/25 19:48:54 DEBUG : HTTP RESPONSE (req 0xc000902300)
2023/07/25 19:48:54 DEBUG : HTTP/1.1 200 OK
Content-Length: 4
Accept-Ranges: bytes
Cf-Ray: 7ec6ec4daf565a1b-IAD
Connection: keep-alive
Content-Type: text/plain
Date: Tue, 25 Jul 2023 19:48:54 GMT
Etag: "59bcc3ad6775562f845953cf01624225"
Last-Modified: Tue, 25 Jul 2023 19:28:31 GMT
Server: cloudflare

2023/07/25 19:48:54 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2023/07/25 19:48:54 DEBUG : testfile1.txt: Excluded
2023/07/25 19:48:54 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Metadata is fully supported on S3. It isn't supported on swift yet.

Swift doens't have metadata support yet (by that I mean using the --metadata flag).

I'm not sure we'd put X-Object-Manifest in the metadata though - its very swift specific. Why do you need it exported to s3?

The easiest way to check metadata is to use rclone lsjson -M or if doing it on a single file rclone lsjson -M --stat. If you do that you'll see metadata on s3 objects but not swift.

Hey @ncw — thanks so much for the reply.

Ah, okay, this has helped debugging things a lot. It looks like on R2 the only fields found in Metadata are btime, content-type and tier:

$ rclone lsjson -M r2-development:rclone-debugging
[
{"Path":"testfile1.txt","Name":"testfile1.txt","Size":4,"MimeType":"text/plain","ModTime":"2023-07-25T19:28:31.010000000Z","IsDir":false,"Tier":"STANDARD","Metadata":{"btime":"2023-07-25T19:28:31.01Z","content-type":"text/plain","tier":"STANDARD"}},
{"Path":"testfile2.txt","Name":"testfile2.txt","Size":4,"MimeType":"text/plain","ModTime":"2023-07-25T19:28:31.102000000Z","IsDir":false,"Tier":"STANDARD","Metadata":{"btime":"2023-07-25T19:28:31.102Z","content-type":"text/plain","tier":"STANDARD"}}
]

In some cases we use X-Object-Manifest to implement a public facing symlink to other files. So we may have:

- internal-file.dat ← actual data file
- public1.abc ← empty, X-Object-Manifest points to internal-file.dat
- public2.abc ← empty, X-Object-Manifest points to internal-file.dat

My hope was that if we cloned over public1.abc and public2.abc and retained their X-Object-Manifest headers (renamed to X-Amz-[...] as need be), we could handle our own symlinking behaviour on R2 with a worker. The current state of affairs is that if we rclone copy the above files from Swift to R2, we'll end up with all three files containing the contents of internal-file.dat.

When I add the --swift-no-large-objects flag, I do see it failing to copy objects with the X-Object-Manifest header, but not in the way I would expect.

For example:

$ rclone lsl -M ovh-development:container1/dlotest/
       22 2023-07-12 21:51:59.000000000 chunk-0
       23 2023-07-12 21:51:59.000000000 chunk-1
       45 2023-07-12 21:50:51.000000000 file.txt

In this case file.txt is an empty file that concatenates chunk-0 and chunk-1.

If I try to copy these files to R2 with the --swift-no-large-objects flag, this is what I see:

$ rclone copy -M ovh-development:container1/dlotest/ r2-development:rclone-swift-large-objects --swift-no-large-objects -vv
2023/07/26 18:55:33 DEBUG : Setting --config "/app/config/rclone.conf" from environment variable RCLONE_CONFIG="/app/config/rclone.conf"
2023/07/26 18:55:33 DEBUG : rclone: Version "v1.63.1" starting with parameters ["rclone" "copy" "-M" "ovh-development:container1/dlotest/" "r2-development:rclone-swift-large-objects" "--swift-no-large-objects" "-vv"]
2023/07/26 18:55:33 DEBUG : Creating backend with remote "ovh-development:container1/dlotest/"
2023/07/26 18:55:33 DEBUG : Using config file from "/app/config/rclone.conf"
2023/07/26 18:55:33 DEBUG : ovh-development: detected overridden config - adding "{i25p5}" suffix to name
2023/07/26 18:55:34 DEBUG : fs cache: renaming cache item "ovh-development:container1/dlotest/" to be canonical "ovh-development{i25p5}:container1/dlotest"
2023/07/26 18:55:34 DEBUG : Creating backend with remote "r2-development:rclone-swift-large-objects"
2023/07/26 18:55:34 DEBUG : name = "r2-development", root = "rclone-swift-large-objects", opt = &s3.Options{Provider:"Cloudflare", EnvAuth:true, AccessKeyID:"", SecretAccessKey:"", Region:"", Endpoint:"https://XXXXXXXXXXXXXXXX.r2.cloudflarestorage.com/", STSEndpoint:"", LocationConstraint:"", ACL:"", BucketACL:"", RequesterPays:false, ServerSideEncryption:"", SSEKMSKeyID:"", SSECustomerAlgorithm:"", SSECustomerKey:"", SSECustomerKeyBase64:"", SSECustomerKeyMD5:"", StorageClass:"", UploadCutoff:209715200, CopyCutoff:4999341932, ChunkSize:5242880, MaxUploadParts:10000, DisableChecksum:false, SharedCredentialsFile:"", Profile:"", SessionToken:"", UploadConcurrency:4, ForcePathStyle:true, V2Auth:false, UseAccelerateEndpoint:false, LeavePartsOnError:false, ListChunk:1000, ListVersion:0, ListURLEncode:fs.Tristate{Value:false, Valid:false}, NoCheckBucket:false, NoHead:false, NoHeadObject:false, Enc:0x3000002, MemoryPoolFlushTime:60000000000, MemoryPoolUseMmap:false, DisableHTTP2:false, DownloadURL:"", DirectoryMarkers:false, UseMultipartEtag:fs.Tristate{Value:false, Valid:false}, UsePresignedRequest:false, Versions:false, VersionAt:fs.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Decompress:false, MightGzip:fs.Tristate{Value:false, Valid:false}, UseAcceptEncodingGzip:fs.Tristate{Value:false, Valid:false}, NoSystemMetadata:false}
2023/07/26 18:55:34 DEBUG : Resolving service "s3" region "us-east-1"
2023/07/26 18:55:34 DEBUG : chunk-0: Need to transfer - File not found at Destination
2023/07/26 18:55:34 DEBUG : chunk-1: Need to transfer - File not found at Destination
2023/07/26 18:55:34 DEBUG : file.txt: Need to transfer - File not found at Destination
2023/07/26 18:55:34 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for checks to finish
2023/07/26 18:55:34 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for transfers to finish
2023/07/26 18:55:34 ERROR : file.txt: Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/26 18:55:34 DEBUG : chunk-0: md5 = 5da91c22318ade8cea63b3b953aacce1 OK
2023/07/26 18:55:34 INFO  : chunk-0: Copied (new)
2023/07/26 18:55:34 DEBUG : chunk-1: md5 = f36ea43e34bb30404c3a4842f2194e2e OK
2023/07/26 18:55:34 INFO  : chunk-1: Copied (new)
2023/07/26 18:55:34 ERROR : Attempt 1/3 failed with 1 errors and: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/26 18:55:34 DEBUG : file.txt: Need to transfer - File not found at Destination
2023/07/26 18:55:34 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for checks to finish
2023/07/26 18:55:34 DEBUG : chunk-1: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/26 18:55:34 DEBUG : chunk-1: Unchanged skipping
2023/07/26 18:55:34 DEBUG : chunk-0: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/26 18:55:34 DEBUG : chunk-0: Unchanged skipping
2023/07/26 18:55:34 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for transfers to finish
2023/07/26 18:55:34 ERROR : file.txt: Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/26 18:55:34 ERROR : Attempt 2/3 failed with 1 errors and: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/26 18:55:34 DEBUG : file.txt: Need to transfer - File not found at Destination
2023/07/26 18:55:34 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for checks to finish
2023/07/26 18:55:35 DEBUG : chunk-1: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/26 18:55:35 DEBUG : chunk-1: Unchanged skipping
2023/07/26 18:55:35 DEBUG : chunk-0: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/26 18:55:35 DEBUG : chunk-0: Unchanged skipping
2023/07/26 18:55:35 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for transfers to finish
2023/07/26 18:55:35 ERROR : file.txt: Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/26 18:55:35 ERROR : Attempt 3/3 failed with 1 errors and: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/26 18:55:35 INFO  : 
Transferred:   	        180 B / 180 B, 100%, 0 B/s, ETA -
Errors:                 1 (retrying may help)
Checks:                 4 / 4, 100%
Transferred:            2 / 2, 100%
Elapsed time:         1.6s

2023/07/26 18:55:35 DEBUG : 23 go routines active
2023/07/26 18:55:35 Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 

Ultimately this does copy the chunk files, and skips the file with the X-Object-Manifest header, so it's producing the outcome I'd expect but I'm unsure if this is the expected behaviour — does it implement skipping large objects by allowing the MD5 check to fail for them? It also looks like it's attempting to copy the chunk files a few times (but skipping the subsequent attempts, as it sees them present on the destination.) It also looks like in this case rclone exits with a return code of 1, meaning if I'm watching for nonzero exit codes to indicate failure, this'll trip that:

$ echo $?
1

There can be other stuff in there if it is set including user metadata.

I see. So X-Object-Manifest which usually points to the file data chunks for the file actually points to the chunks for a different file.

Rclone normally reads the manifests and fetches the chunks without user interactions which is why I was thinking it was an odd thing to put in the metadata.

It sounds like, in an ideal world, you'd want rclone not to copy the files which are effectively symlinks.

Swift is rather primitive when it comes to large files! Your chunks were in the scope of the copy so they got copied. Rclone can't tell that they are part of a large object file.

The recommendation as far as I remember was to put the chunks in a separate container and that is what rclone normally does.

From the --swift-no-large-objects docs

If you set this option and there are static or dynamic large objects,
then this will give incorrect hashes for them. Downloads will succeed,
but other operations such as Remove and Copy will fail.

So that is why you are getting checksum failure.

It looks like what you really want is a --swift-skip-large-objects flags which would leave them out of directory listings entirely.

I guess this could be done with the manifest metadata too.

So, if we were to put the value of the manifest into the metadata you could then filter on it using metadata filtering. You could filter such that the non symlink type manifests could be copied and the symlink type ones weren't copied (assuming you can tell them apart with a regexp).

I can't think of a mechanism which would let you copy the symlink type objects without copying their data, but what you could do is use rclone lsjson or rclone lsf to find them all and list them and a simple script could be used to create the empty objects with metadata.

So for this to work this would need metadata support added to the swift backend. You could have a go at this yourself, or you could take out a support contract if you are working for a company.

Thanks again for the response @ncw.

That's exactly right — I know I did read the documentation on that, but at some point it slipped my mind and I mis-remembered and convinced myself it was supposed to cause rclone to skip large objects. Sorry for that mix-up on my end.

I know that Swift doesn't include any large-object-disambiguating info in a lookup of multiple files, so I'm guessing this is aimed at cutting down on all those extra HEADs when they're not needed.

I'll absolutely take a crack at it — I'll likely pop back up here if I have any questions as I work on this.

One other thing I wanted to flag to you: it looks like when --swift-no-large-objects causes that hash fail and the transfer retry, those retries seem to cause all files to be retried, not just the DLO file with the failed hash. I've added a handful of additional files to my test path and you can see in the logs below that every file in the source path gets attempted three times. If you'd like me to open a bug report about this, I'm happy to.

$ rclone copy -M ovh-development:container1/dlotest/ r2-development:rclone-swift-large-objects --swift-no-large-objects -vv
2023/07/27 17:32:23 DEBUG : Setting --config "/app/config/rclone.conf" from environment variable RCLONE_CONFIG="/app/config/rclone.conf"
2023/07/27 17:32:23 DEBUG : rclone: Version "v1.63.1" starting with parameters ["rclone" "copy" "-M" "ovh-development:container1/dlotest/" "r2-development:rclone-swift-large-objects" "--swift-no-large-objects" "-vv"]
2023/07/27 17:32:23 DEBUG : Creating backend with remote "ovh-development:container1/dlotest/"
2023/07/27 17:32:23 DEBUG : Using config file from "/app/config/rclone.conf"
2023/07/27 17:32:23 DEBUG : ovh-development: detected overridden config - adding "{i25p5}" suffix to name
2023/07/27 17:32:23 DEBUG : fs cache: renaming cache item "ovh-development:container1/dlotest/" to be canonical "ovh-development{i25p5}:container1/dlotest"
2023/07/27 17:32:23 DEBUG : Creating backend with remote "r2-development:rclone-swift-large-objects"
2023/07/27 17:32:23 DEBUG : name = "r2-development", root = "rclone-swift-large-objects", opt = &s3.Options{Provider:"Cloudflare", EnvAuth:true, AccessKeyID:"", SecretAccessKey:"", Region:"", Endpoint:"https://XXXXXXXXXXXX.r2.cloudflarestorage.com/", STSEndpoint:"", LocationConstraint:"", ACL:"", BucketACL:"", RequesterPays:false, ServerSideEncryption:"", SSEKMSKeyID:"", SSECustomerAlgorithm:"", SSECustomerKey:"", SSECustomerKeyBase64:"", SSECustomerKeyMD5:"", StorageClass:"", UploadCutoff:209715200, CopyCutoff:4999341932, ChunkSize:5242880, MaxUploadParts:10000, DisableChecksum:false, SharedCredentialsFile:"", Profile:"", SessionToken:"", UploadConcurrency:4, ForcePathStyle:true, V2Auth:false, UseAccelerateEndpoint:false, LeavePartsOnError:false, ListChunk:1000, ListVersion:0, ListURLEncode:fs.Tristate{Value:false, Valid:false}, NoCheckBucket:false, NoHead:false, NoHeadObject:false, Enc:0x3000002, MemoryPoolFlushTime:60000000000, MemoryPoolUseMmap:false, DisableHTTP2:false, DownloadURL:"", DirectoryMarkers:false, UseMultipartEtag:fs.Tristate{Value:false, Valid:false}, UsePresignedRequest:false, Versions:false, VersionAt:fs.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Decompress:false, MightGzip:fs.Tristate{Value:false, Valid:false}, UseAcceptEncodingGzip:fs.Tristate{Value:false, Valid:false}, NoSystemMetadata:false}
2023/07/27 17:32:23 DEBUG : Resolving service "s3" region "us-east-1"
2023/07/27 17:32:23 DEBUG : chunk-0: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : chunk-1: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file.txt: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file1.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file2.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file3.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file4.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file5.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file6.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file7.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : file8.abc: Need to transfer - File not found at Destination
2023/07/27 17:32:23 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for checks to finish
2023/07/27 17:32:23 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for transfers to finish
2023/07/27 17:32:24 DEBUG : chunk-0: md5 = 5da91c22318ade8cea63b3b953aacce1 OK
2023/07/27 17:32:24 INFO  : chunk-0: Copied (new)
2023/07/27 17:32:24 ERROR : file.txt: Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/27 17:32:24 DEBUG : chunk-1: md5 = f36ea43e34bb30404c3a4842f2194e2e OK
2023/07/27 17:32:24 INFO  : chunk-1: Copied (new)
2023/07/27 17:32:24 DEBUG : file1.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:24 INFO  : file1.abc: Copied (new)
2023/07/27 17:32:24 DEBUG : file3.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:24 INFO  : file3.abc: Copied (new)
2023/07/27 17:32:24 DEBUG : file4.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:24 INFO  : file4.abc: Copied (new)
2023/07/27 17:32:24 DEBUG : file2.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:24 INFO  : file2.abc: Copied (new)
2023/07/27 17:32:25 DEBUG : file5.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:25 INFO  : file5.abc: Copied (new)
2023/07/27 17:32:25 DEBUG : file6.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:25 INFO  : file6.abc: Copied (new)
2023/07/27 17:32:25 DEBUG : file7.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:25 INFO  : file7.abc: Copied (new)
2023/07/27 17:32:25 DEBUG : file8.abc: md5 = c70fed2f31eccabf0424c0720daa615d OK
2023/07/27 17:32:25 INFO  : file8.abc: Copied (new)
2023/07/27 17:32:25 ERROR : Attempt 1/3 failed with 1 errors and: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/27 17:32:25 DEBUG : file.txt: Need to transfer - File not found at Destination
2023/07/27 17:32:25 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for checks to finish
2023/07/27 17:32:25 DEBUG : file1.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file1.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file3.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file3.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : chunk-0: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : chunk-0: Unchanged skipping
2023/07/27 17:32:25 DEBUG : chunk-1: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : chunk-1: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file4.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file4.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file6.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file6.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file5.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file5.abc: Unchanged skipping
2023/07/27 17:32:25 ERROR : file.txt: Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/27 17:32:25 DEBUG : file8.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file8.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file7.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file7.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file2.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file2.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for transfers to finish
2023/07/27 17:32:25 ERROR : Attempt 2/3 failed with 1 errors and: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/27 17:32:25 DEBUG : file.txt: Need to transfer - File not found at Destination
2023/07/27 17:32:25 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for checks to finish
2023/07/27 17:32:25 DEBUG : chunk-0: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : chunk-0: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file2.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file2.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : chunk-1: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : chunk-1: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file4.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file4.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file5.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file5.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file3.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file3.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file1.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file1.abc: Unchanged skipping
2023/07/27 17:32:25 DEBUG : file6.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:25 DEBUG : file6.abc: Unchanged skipping
2023/07/27 17:32:26 ERROR : file.txt: Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/27 17:32:26 DEBUG : file7.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:26 DEBUG : file7.abc: Unchanged skipping
2023/07/27 17:32:26 DEBUG : file8.abc: Size and modification time the same (differ by 0s, within tolerance 1ns)
2023/07/27 17:32:26 DEBUG : file8.abc: Unchanged skipping
2023/07/27 17:32:26 DEBUG : S3 bucket rclone-swift-large-objects: Waiting for transfers to finish
2023/07/27 17:32:26 ERROR : Attempt 3/3 failed with 1 errors and: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 
2023/07/27 17:32:26 INFO  : 
Transferred:   	        252 B / 252 B, 100%, 103 B/s, ETA 0s
Errors:                 1 (retrying may help)
Checks:                20 / 20, 100%
Transferred:           10 / 10, 100%
Elapsed time:         2.8s

2023/07/27 17:32:26 DEBUG : 42 go routines active
2023/07/27 17:32:26 Failed to copy: BadDigest: The MD5 checksum you specified did not match what we received.
You provided a MD5 checksum with value: 7f224b56a81fbc6213d9b9fb862ead1d
Actual MD5 was: 6b6429e59ad0fa46950cbf706c97921f
	status code: 400, request id: , host id: 

I've got another other possible way of accomplishing this, but I was wondering if you could weigh in on whether it would be a performance nightmare for rclone @ncw:

We can easily generate a list of all files that are being used as symlinks, so it seems reasonable to me that we could use that as a big exclude list that we can use in conjunction with --filter-from.

My question is whether rclone is going to be unhappy with a list of 20k+ filter rules in that file. On my test data set, that 20k-rule file works fine, but rclone is only copying over a handful of files and I'm unsure if that will add an untenable quantity of processing overhead on each file as it's checked against the filter list.

Alternately...

My other idea here is to try setting some X-Amz-* headers on my symlink files, then having rclone use the S3 API to access these files, which Swift and OVH do support, at which point I should be able to use the metadata filters as I had originally intended. (Although this could be a longshot depending on how well the Swift S3 API works.)

Yes the no_large_objects is to cut down on the HEAD requests which really kills performance.

You'll need them to read the X-Large-Object though.

Great :slight_smile:

Ideally this would be a full metadata implementation like the one for s3, but that is a bigger job than you actually need.

That is rclone doing a high level retry controlled with the --retries flag. If the sync fails, rclone will try the whole thing again. This turns out to be an effective strategy when things go wrong. There is also a --low-level-retries flag which retries individual operations.

The files get stored in a Go map which should be efficient at any scale.

Checkout the --no-traverse flag also which is useful if you are filtering most of the source files.

That would certainly work! Though the --files-from idea sounds easier.

Awesome — just to be sure, I'm actually looking using --filter-from to exclude about 20k files (out of 500-800k total), rather than using --files-from to include specific files. That should hopefully still pose no problems for rclone?

Argh, my eyes, misread what you wrote!

If you use --filter-from then rclone will be traversing a list of 20k regexps every file listed. That sounds bad but it probably won't take longer than 10ms and because of rclone's parallel nature I doubt you'll even notice the 10ms per file overhead.

If you can use --files-from then it will be quicker.

Hahah, yeah, it looks like it's holding up just fine on the longer filter list. :slightly_smiling_face:

Hey @ncw — got one more for you. I'm trying to generate a list of full paths for files with a specific extension and I'm seeing quite a few more requests than I'm expecting. Here's a sample command:

rclone lsf ovh-development: --include container1/**.png --files-only --fast-list -R --absolute -v --dump headers --swift-no-large-objects

It looks like two things are happening:

  1. rclone is still listing files inside excluded containers, and then excluding files from the list without further lookups, which is not generating a ton of extra requests:
2023/08/01 14:25:54 DEBUG : GET /v1/AUTH_XXXXXXXXXXXXXXXXXXXX/container2?format=json&limit=1000 HTTP/1.1
  • On a similarly filtered command that is a copy, I do see it skipping other containers without looking them up:
2023/08/01 16:34:48 DEBUG : container2: Excluded
  1. More costly, though: rclone is HEADing any files with a zero-length, despite --swift-no-large-objects, and even if the files don't match the filter:
2023/08/01 14:25:54 DEBUG : HEAD /v1/AUTH_XXXXXXXXXXXXXXXXXXXX/container1/test2.txt HTTP/1.1

From looking at the code for the Swift backend, I'm wondering whether it's in newObjectWithInfo — it looks like it HEADs all zero-length (non-directory) objects regardless of whether --swift-no-large-objects is set.

I've given this patch a quick test locally and it seems to remove those extraneous HEADs for zero-length objects with that flag set (and they remain if it's unset), but I'm unsure if there are other considerations I should be worrying about.

diff --git a/backend/swift/swift.go b/backend/swift/swift.go
index 85ce87f25..6fbd740ec 100644
--- a/backend/swift/swift.go
+++ b/backend/swift/swift.go
@@ -561,7 +561,7 @@ func (f *Fs) newObjectWithInfo(ctx context.Context, remote string, info *swift.O
 	// returned as 0 bytes in the listing.  Correct this here by
 	// making sure we read the full metadata for all 0 byte files.
 	// We don't read the metadata for directory marker objects.
-	if info != nil && info.Bytes == 0 && info.ContentType != "application/directory" {
+	if info != nil && info.Bytes == 0 && info.ContentType != "application/directory" && !o.fs.opt.NoLargeObjects {
 		err := o.readMetaData(ctx) // reads info and headers, returning an error
 		if err == fs.ErrorObjectNotFound {
 			// We have a dangling large object here so just return the original metadata

This will be because of --fast-list. It lists everything first and filters afterwards

That looks more problematic.

I think that patch is correct. no_large_objects says in its help:

When no_large_objects is set, rclone will assume that there are no static or dynamic large objects stored. This means it can stop doing the extra HEAD calls which in turn increases performance greatly especially when doing a swift to swift transfer with --checksum set.

So I think it is a bug that it is doing the HEAD requests.

Can you send a pull request with your patch?

Ah, got it. :slightly_smiling_face:

Sure thing — here's the PR: swift: not HEADing 0-length objects when flag set.

Merged - thank you :slight_smile:

Happy I could help!

Thanks to your help here I've successfully synced over quite a large volume of data from Swift to R2; I'm now looking at how to best copy over any new files on the source daily. My initial thought was to just run a daily copy operation, but I'm now wondering whether running a check --missing-on-dst to produce a file list, then feeding that to a copy command would be faster. Could that be the case, or are they likely pretty equivalent?

Copy and check will be pretty similar.

Using --no-traverse will avoid a scan of the destination if using copy. If the number of new files is small this will be a good flag to use.

I don't think you can avoid scan of the source though.

Thanks @ncw, that's helped me figure it out.

We're trying to troubleshoot some issues with some transfers and I'm trying to make sure I'm reading the logs correctly:

2023/08/18 17:37:24 NOTICE: XXXXXXXXXXXXXXXXXXXXX: Failed to read metadata: : 
	status code: 522, request id: , host id: 
2023/08/18 17:37:24 INFO  : XXXXXXXXXXXXXXXXXXXXX: Updated modification time in destination

[...]


2023/08/18 17:53:14 ERROR : Attempt 1/1 failed with 4 errors and: march failed with 3 error(s): first error: : 
	status code: 522, request id: , host id: 

This is a copy command from Swift (OVH) to S3 (R2) with -v --retries 1.

What I'm trying to figure out is: are these errors happening on the source or destination (I'm guessing destination, as the 522 error seems used by Cloudflare and I don't see references to it anywhere in Swift) — absent this inference, would there be a straightforward way for me to determine whether these are source/dest errors at this log level? The file throwing the errors exists at the same path on both source and dest.

Also, is rclone redacting the request id and host id fields in my errors on purpose? I couldn't find a flag or documentation on that. Looking at other posted logs, I'm seeing that they usually don't contain these values, but occasionally do.

Sometimes it is difficult to tell source vs dst errors. You can turn on -vv --dump headers to see the http requests/response headers which will help.

Rclone just passes on what the provider sends it normally. The error message will only show the text part of the error normally so if the provider stuck id and host in there you'll see it.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.