Hi,
I'm in a business when we need to copy large amount of AWS S3 data within the same region - so obviously using server side copying.
We were using s3s3mirror for this - GitHub - cobbzilla/s3s3mirror: Mirror one S3 bucket to another S3 bucket, or to/from the local filesystem.. It's an old Java code (uses AWS SDK v1), pretty ugly, with some really scary bugs - it can even remove your source data in some edge cases
So we had to patch it extensively to make it useful.
But boy, it is FAST. On a pretty large set of keys (tens of thousands, few TB in total), it seems to be like 10 times faster (or more) than rclone.
We would prefer to use rclone, as it's more universal and I have more trust in it not deleting random keys.
So, I'm wondering is it possible to get the similar performance from the rclone.
From what I see, s3s3mirror is super-aggressive in making AWS API requests. It uses 100 threads (by default) to make requests. In the example run it ended up with like over 6K total requests per minute, including over 2K copy requests per minute.
In comparison, with rclone I see like about 1 copy operation per second.
Our command line is something like: rclone --s3-acl '' --fast-list --checksum --s3-upload-cutoff 50m --s3-chunk-size 16 m --s3-upload-concurrency 8 source target
Any thoughts on this? Would it be possible for rclone to make copy requests faster? Is the '-s3-upload-concurrency' used also for threads issuing S3 copy requests?
RAM is not my problem.
I just made a simple test with copying 5K of s3 keys from one bucket to another, empty one (so no comparisons required).
With s3s3mirror and default threads number (100) it took 2 minutes 59 seconds, with cpu usage on my macbook reported as 30%.
With rclone it took 52 minutes 13 seconds - over 17 times more.
Do I understand correctly that listing keys in rclone is performed in a single thread?
And to clarify - I'm not asking here for the solution of the "business" problem. The solution seems to be "just use s3s3mirror". I'm just wondering if there's anything I missed in configuration to improve the performance of server-side copying. As obviously there's a pretty large room for improvement.
Ok, it seems it can be improved
Running copy with --checkers=100, --transfers=100, --s3-no-head, --fast-list I was able to get down to 2:09 minutes, so almost 25 times faster - and a little faster than s3s3mirror.
The documentation about checkers/transfers is very conservative - should we mention that it may make sense to use significantly higher values?
You can probably add --use-server-modtime --update to save some head requests
Try with and without --fast-list as it can be quicker without with lots of checkers depending on your directory layout.
The default transfers and checkers are conservative. However not all backends can cope with such large values - if you try that on Google drive you'll get rate limited into next year!
A specific bit in the S3 docs is a good idea saying to try higher values.
We should probably have a bit about how to tune the performance. I usually recommend doubling transfers until you stop seeing an improvement, or you max out your network, ram or CPU.
I have reports of rclone filling 40Gbit/s network pipes on the right machine!
Thanks for tips! Tried without --fast-list but the time was more or less the same. Using --use-server-modtime --update instead of --checksum also didn't make the difference in my test.
I could create a PR for docs, any suggestions where to put this information?