Rclone performance vs s3s3mirror

ppalucha · September 5, 2024, 1:56pm

Hi,
I'm in a business when we need to copy large amount of AWS S3 data within the same region - so obviously using server side copying.
We were using s3s3mirror for this - GitHub - cobbzilla/s3s3mirror: Mirror one S3 bucket to another S3 bucket, or to/from the local filesystem.. It's an old Java code (uses AWS SDK v1), pretty ugly, with some really scary bugs - it can even remove your source data in some edge cases
So we had to patch it extensively to make it useful.

But boy, it is FAST. On a pretty large set of keys (tens of thousands, few TB in total), it seems to be like 10 times faster (or more) than rclone.
We would prefer to use rclone, as it's more universal and I have more trust in it not deleting random keys.

So, I'm wondering is it possible to get the similar performance from the rclone.

From what I see, s3s3mirror is super-aggressive in making AWS API requests. It uses 100 threads (by default) to make requests. In the example run it ended up with like over 6K total requests per minute, including over 2K copy requests per minute.

In comparison, with rclone I see like about 1 copy operation per second.

Our command line is something like:
rclone --s3-acl '' --fast-list --checksum --s3-upload-cutoff 50m --s3-chunk-size 16 m --s3-upload-concurrency 8 source target

Any thoughts on this? Would it be possible for rclone to make copy requests faster? Is the '-s3-upload-concurrency' used also for threads issuing S3 copy requests?

Regards
Paweł

asdffdsa · September 5, 2024, 2:07pm

hi,

there is no command in that? are you using copy, sync or what?

i believe that acl is deprecated and not recommend on AWS S3.
not sure if that involves additional api calls?

to see the api calls, use --dump=headers

do you run the command on a schedule?

ppalucha · September 5, 2024, 2:13pm

there is no command in that? are you using copy , sync or what?

Sorry, this was 'copy'.

As for acl, -s3-acl '' is actually required to work with buckets with ACL disabled. I guess it may only speed things up.

do you run the command on a schedule?

No, why does it matter?
Paweł

asdffdsa · September 5, 2024, 3:59pm

might check out
https://forum.rclone.org/t/big-syncs-with-millions-of-files/40182

ppalucha · September 5, 2024, 4:35pm

RAM is not my problem.
I just made a simple test with copying 5K of s3 keys from one bucket to another, empty one (so no comparisons required).
With s3s3mirror and default threads number (100) it took 2 minutes 59 seconds, with cpu usage on my macbook reported as 30%.
With rclone it took 52 minutes 13 seconds - over 17 times more.

Do I understand correctly that listing keys in rclone is performed in a single thread?

And to clarify - I'm not asking here for the solution of the "business" problem. The solution seems to be "just use s3s3mirror". I'm just wondering if there's anything I missed in configuration to improve the performance of server-side copying. As obviously there's a pretty large room for improvement.

Paweł

ppalucha · September 5, 2024, 5:56pm

Ok, it seems it can be improved
Running copy with --checkers=100, --transfers=100, --s3-no-head, --fast-list I was able to get down to 2:09 minutes, so almost 25 times faster - and a little faster than s3s3mirror.

The documentation about checkers/transfers is very conservative - should we mention that it may make sense to use significantly higher values?

Paweł

ncw · September 5, 2024, 6:11pm

Hooray!

You can probably add --use-server-modtime --update to save some head requests

Try with and without --fast-list as it can be quicker without with lots of checkers depending on your directory layout.

The default transfers and checkers are conservative. However not all backends can cope with such large values - if you try that on Google drive you'll get rate limited into next year!

A specific bit in the S3 docs is a good idea saying to try higher values.

We should probably have a bit about how to tune the performance. I usually recommend doubling transfers until you stop seeing an improvement, or you max out your network, ram or CPU.

I have reports of rclone filling 40Gbit/s network pipes on the right machine!

ppalucha · September 5, 2024, 7:08pm

Thanks for tips! Tried without --fast-list but the time was more or less the same. Using --use-server-modtime --update instead of --checksum also didn't make the difference in my test.

I could create a PR for docs, any suggestions where to put this information?

Paweł

ncw · September 5, 2024, 9:35pm

Maybe make a section after the reducing costs section, say "Increasing performance"

github.com

rclone/rclone/blob/master/docs/content/s3.md#reducing-costs

---
title: "Amazon S3"
description: "Rclone docs for Amazon S3"
versionIntroduced: "v0.91"
---

# {{< icon "fab fa-amazon" >}} Amazon S3 Storage Providers

The S3 backend can be used with a number of different providers:

{{< provider_list >}}
{{< provider name="AWS S3" home="https://aws.amazon.com/s3/" config="/s3/#configuration" start="true" >}}
{{< provider name="Alibaba Cloud (Aliyun) Object Storage System (OSS)" home="https://www.alibabacloud.com/product/oss/" config="/s3/#alibaba-oss" >}}
{{< provider name="Ceph" home="http://ceph.com/" config="/s3/#ceph" >}}
{{< provider name="China Mobile Ecloud Elastic Object Storage (EOS)" home="https://ecloud.10086.cn/home/product-introduction/eos/" config="/s3/#china-mobile-ecloud-eos" >}}
{{< provider name="Cloudflare R2" home="https://blog.cloudflare.com/r2-open-beta/" config="/s3/#cloudflare-r2" >}}
{{< provider name="Arvan Cloud Object Storage (AOS)" home="https://www.arvancloud.com/en/products/cloud-storage" config="/s3/#arvan-cloud" >}}
{{< provider name="DigitalOcean Spaces" home="https://www.digitalocean.com/products/object-storage/" config="/s3/#digitalocean-spaces" >}}
{{< provider name="Dreamhost" home="https://www.dreamhost.com/cloud/storage/" config="/s3/#dreamhost" >}}
{{< provider name="GCS" home="https://cloud.google.com/storage/docs" config="/s3/#google-cloud-storage" >}}

This file has been truncated. show original

The above is the source doc for the S3 backend docs

ppalucha · September 11, 2024, 7:27am

PR created: docs: add section for improving performance for s3 by ppalucha · Pull Request #8066 · rclone/rclone · GitHub

system · November 10, 2024, 7:27am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.