Is there a way not to use delimiter for listing?

mikecmpbll · July 30, 2021, 1:39pm

What is the problem you are having with rclone?

We're trying to optimise sync'ing a large swift container to s3 (16m objects, >10TB). I'm trying to understand the rclone design but struggling with the delimiter stuff.

It seems to me that in two "flat" object storage platforms like swift & s3 the delimiter ('/') only causes there to be a ton of uneccessary listing queries if there's lots of "delimiter depth" in the storage? The only option to disable delimiter appears to be --fast-list, which appears to require it to store all the listing results in memory up front. Is there a design reason that rclone can't iterate through the listing without delimiter, and without storing the entire listing in memory?

What is your rclone version (output from `rclone version`)

rclone v1.55.1

os/type: linux
os/arch: amd64
go/version: go1.16.3
go/linking: static
go/tags: none

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Ubuntu 18.04.2 LTS, 64 bit

Which cloud storage system are you using? (eg Google Drive)

Swift + S3

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone sync swift:mycontainer aws:mycontainerbackup --stats 30s --dry-run --size-only -vv --dump-bodies --log-file synctest.log

The rclone config contents with secrets removed.

[aws]
type = s3
provider = AWS
env_auth = false
access_key_id = [redact]
secret_access_key = [redact]
region = [redact]
location_constraint = [redact]
acl = private

[memstore]
type = swift
env_auth = false
user = [redact]
key = [redact]
auth = https://[redact]/v2.0
tenant = mytenant

A log from the command with the `-vv` flag

2021/07/30 14:10:02 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/07/30 14:10:02 DEBUG : HTTP REQUEST (req 0xc0000fca00)
2021/07/30 14:10:02 DEBUG : GET /[redact]/mycontainer?delimiter=%2F&format=json&limit=1000 HTTP/1.1
Host: [redact]
User-Agent: rclone/v1.55.1
Transfer-Encoding: chunked
X-Auth-Token: XXXX
Accept-Encoding: gzip

0

asdffdsa · July 30, 2021, 1:50pm

hello and welcome to the forum,

is this a one-time sync or to be run multiple times?

tho not a direct answer to your question, i thought i would share it with you.

given that both swift and s3 use md5 checksums, you should find better performance using
https://rclone.org/docs/#c-checksum

mikecmpbll · July 30, 2021, 1:52pm

hi @asdffdsa, thanks

this is to be run on schedule. in my testing --checksum was significantly slower

mikecmpbll · July 30, 2021, 2:17pm

as a workaround, i was looking into using rclone ls and to build my own syncing logic off the back of that. the listing query seems to return all the information that rclone needs to list the object, but it appears to do a separate HEAD request for each object it lists. is this intentional and if so what is the reason?

with --dump-bodies, the listing query returns something like:

[{"hash": "d41d8cd98f00b204e9800998ecf8427e", "last_modified": "2016-02-15T19:45:54.843920", "bytes": 0, "name": ".", "content_type": "application/json"}, {"hash": "dfa5339c050c75b6dc81aa46bbfb2673", "last_modified": "2018-02-13T11:03:29.113780", "bytes": 16944, "name": "foobar.docx", "content_type": "application/octet-stream"}, ...]

and then there's subsequent HEAD requests like:

2021/07/30 15:04:45 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/07/30 15:04:45 DEBUG : HTTP REQUEST (req 0xc000500800)
2021/07/30 15:04:45 DEBUG : HEAD [redact]/foobar.docx HTTP/1.1
Host: [redact]
User-Agent: rclone/v1.55.1
Transfer-Encoding: chunked
X-Auth-Token: XXXX

0

2021/07/30 15:04:45 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/07/30 15:04:45 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/07/30 15:04:45 DEBUG : HTTP RESPONSE (req 0xc000500800)
2021/07/30 15:04:45 DEBUG : HTTP/1.1 200 OK
Content-Length: 16944
Accept-Ranges: bytes
Content-Type: application/octet-stream
Date: Fri, 30 Jul 2021 14:04:45 GMT
Etag: dfa5339c050c75b6dc81aa46bbfb2673
Last-Modified: Tue, 13 Feb 2018 11:03:29 GMT
X-Timestamp: 1518519809.11378
X-Trans-Id: txeab3baca5f8749e097c26-00610406fd

with the terminal output of:

        0 2016-02-15 19:45:54.000000000 ．
    16944 2018-02-13 11:03:29.000000000 foobar.docx

clearly with >16M objects we don't want a request per object when all the information for 1000 objects is in the listing request.

mikecmpbll · July 30, 2021, 2:45pm

i think i can see why the full listing is required by design with --fast-list, because otherwise it can't compare storageA with storageB deterministically?

should arguably be possible using consistent listing markers on both storage accounts, though .

e.g.:

loop
- list 1000 from storageA
- get the marker
- list from storageB up until marker
- do checking & syncing

or something similar.. but i can see how this is difficult to make portable across backends, particularly in that it requires identical listing order..

system · September 29, 2021, 10:46am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.