Seeking a way to speed up list objects process with S3

EthanHuang · July 28, 2021, 9:52am

What is the problem you are having with rclone?

My case is to sync data between two Ceph clusters via s3 interface. And I have some buckets that contains large small objects(greater than 600,000, where each object is in 4kb ~ 4M).

I understand that command sync will first list all objects before real sync process. The problem is the list process costs too much time: each api call only to retrieve 1000 objects maxinum costs around 3s, 600,000 objects means 1800s, that is half an hour (in fact I have anohter buckets with billion objects to be migrated...), and the slow list process seems to be a known flaw of Ceph.

I'm asking for help, except --fast-list, is there any way or idea, or suggestion else to speed this migration up? Thanks.

Anything would be appreciate!!

Other hands, we actually did some custom modifications to speed up list_objects process of Ceph, still use s3 protocol but need the client to use a custom parameter in the list_object api call. Is Rclone support this kind of custom parameters? Modify source code is also acceptable for us, please help us to know where to start.

Thanks.

What is your rclone version (output from `rclone version`)

rclone v1.56.0

os/version: bigcloud 7.6.1905 (64 bit)
os/kernel: 4.19.25-200.el7.bclinux.x86_64 (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.16.5
go/linking: static
go/tags: none

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Centos 7.6 64bit

Which cloud storage system are you using? (eg Google Drive)

S3 Ceph

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

 rclone --config=/root/rclone.conf size src:hyt_src_bucket2

The rclone config contents with secrets removed.

[src]
type = s3
provider = Ceph
env_auth = false
access_key_id = XXX
secret_access_key = XXX
endpoint = http://127.0.0.1:8080

A log from the command with the `-vv` flag

2021/07/28 16:20:33 DEBUG : rclone: Version "v1.56.0" starting with parameters ["rclone" "--config=/root/rclone.conf" "size" "src:hyt_src_bucket2" "-vv"]
2021/07/28 16:20:33 DEBUG : Creating backend with remote "src:hyt_src_bucket2"
2021/07/28 16:20:33 DEBUG : Using config file from "/root/rclone.conf"

asdffdsa · July 28, 2021, 2:12pm

hello and welcome to the forum,

is this a one time sync or to be run on a schedule?

assuming reoccurring syncs, once the first sync is done, for each additional sync, how many files would need to be synced?

for the first sync, this might help
https://rclone.org/docs/#no-check-dest

one option is to write a script, as shown in this post
https://forum.rclone.org/t/efficiently-update-copy-only-very-few-files-to-google-drive/23276/7

EthanHuang · July 29, 2021, 1:34am

Thanks for the links, I will definitely look into them.

For you questions, for the whole migration work, it's a one time sync, which simply migrates data from one Ceph cluster to another; to achieve this, it may needs multiple rclone sync operations, and each additional sync may have around 100,000 as the incremental

ncw · July 29, 2021, 3:08pm

We recently implemented

  --s3-list-chunk int   Size of listing chunk (response list for each ListObject S3 request). (default 1000)

This won't work on s3 but it will work on CEPH - I think the CEPH limit is more like 10,000

Note that for syncing s3->s3 using --checksum is the best mode for rclone sync or rclone copy.

EthanHuang · August 3, 2021, 9:56am

Thanks, I'll make a try

EthanHuang · August 3, 2021, 10:11am

@ncw I've made a simple test using sync, and looks like the perf is outstanding.

I ran rclone in the same machine with cluster A, and sync data to cluster B via a bond network card (full duplex, 20Gb/s), the transfer speed reached 800MB/s!

But I don't understand why there are both recv traffic and send traffic at the same time? I think rclone starts with list all objects in cluster A to memory, and compare them only using the metadata in cluster B, then makes the transfer if necessary. So there should be only traffic out? what am I missing? Do you mind a bit explanation here? many thanks.

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   1  98   0   0   0|2706k 3756k|   0     0 |   0     0 |  15k   18k
  9   5  84   0   0   1|   0     0 | 756M  750M|   0     0 | 126k   87k
  9   4  85   0   0   1|   0    68k| 795M  811M|   0     0 | 112k   87k
  7   4  89   0   0   1|   0     0 | 870M  858M|   0     0 | 108k   81k
  8   4  87   0   0   1|   0     0 | 825M  828M|   0     0 | 108k   80k

EthanHuang · August 3, 2021, 10:14am

Is this something with the checkers?

Maybe rclone needs to read it again to do integrity check after upload it.

EthanHuang · August 3, 2021, 11:24am

I found the reason. Ignore me

ncw · August 10, 2021, 3:41pm

Rclone lists both the clusters to there will be incoming data from both in that phase.

When it starts transferring then the data will be read from one and write to the other.

system · October 10, 2021, 11:42am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.