Syncing lot of small files from DO to AWS slowing down over time

kborges · December 28, 2020, 4:29pm

What is the problem you are having with rclone?

I'm trying to sync/copy a bucket with 2.2TB of small files (images between 50KB and 500KB) from Digital Ocean Spaces and AWS S3.

The small number of bigger files (50 files of 3 and 5 GB) were okay transferring. The smaller ones start transferring fast but the speed decreases over time. It starts around 50MB/s and after 10 hours is like 300 KB/s. I'm doing it from a EC2 instance (t2.medium, 4GB, 2 CPU, 30GB Disk with 22 on SWAP) in AWS with Ubuntu 20.04.1.

I tried tweaking the command with some options but without success.

What is your rclone version (output from `rclone version`)

rclone v1.53.3

os/arch: linux/amd64
go version: go1.15.5

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-1029-aws x86_64)

Which cloud storage system are you using? (eg Google Drive)

DigitalOcean Spaces
AWS S3

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone --size-only --transfers 8 --checkers 48 --max-backlog 99999999 --retries 999999 --log-level INFO --log-file prodreceipts-"`date +"%Y-%m-%d-%H-%M-%S"`".log sync spaces:receipts AWSS3:receipts &

The rclone config contents with secrets removed.

[AWSS3]
type = s3
provider = AWS
env_auth = false
access_key_id =
secret_access_key =
region = us-east-1
acl = public-read

[spaces]
type = s3
provider = DigitalOcean
env_auth = false
access_key_id =
secret_access_key = 
endpoint = nyc3.digitaloceanspaces.com
acl = public-read

A log from the command with the `-vv` flag

2020/12/28 16:06:44 DEBUG : rclone: Version "v1.53.3" starting with parameters ["rclone" "--size-only" "--transfers" "8" "--checkers" "48" "--max-backlog" "99999999" "--retries" "999999" "--log-level" "DEBUG" "--log-file" "prodreceipts-2020-12-28-16-06-44.log" "sync" "spaces:vexpenses/prod/receipts" "AWSS3:vexpenses/prod/receipts"]
2020/12/28 16:06:44 DEBUG : Using config file from "/home/ubuntu/.config/rclone/rclone.conf"
2020/12/28 16:06:44 DEBUG : Creating backend with remote "spaces:vexpenses/prod/receipts"
2020/12/28 16:06:44 DEBUG : Creating backend with remote "AWSS3:vexpenses/prod/receipts"
2020/12/28 16:07:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       1m0.3s

2020/12/28 16:08:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       2m0.3s

2020/12/28 16:09:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       3m0.3s

2020/12/28 16:10:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       4m0.3s

2020/12/28 16:11:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       5m0.3s

2020/12/28 16:12:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       6m0.3s

2020/12/28 16:13:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       7m0.3s

2020/12/28 16:14:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       8m0.3s

2020/12/28 16:15:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:       9m0.3s

2020/12/28 16:16:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      10m0.3s

2020/12/28 16:17:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      11m0.3s

2020/12/28 16:18:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      12m0.3s

2020/12/28 16:19:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      13m0.3s

2020/12/28 16:20:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      14m0.3s

2020/12/28 16:21:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      15m0.3s

2020/12/28 16:22:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      16m0.3s

2020/12/28 16:23:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      17m0.4s

2020/12/28 16:24:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      18m0.3s

2020/12/28 16:25:44 INFO  : 
Transferred:   	         0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:      19m0.3s

This is part of the logfile I was using to follow the process before: https://pastebin.com/qE2iCM4N

AWS Monitor of the EC2 Instance responsible of running the command:

ncw · December 28, 2020, 4:50pm

If the small files are the problem then you are hitting a files per second limit I'd say.

You could try increasing --transfers more - that should help. Try 32 or 64.

You might also be hitting rate limits at digital ocean

Spaces have the following request rate limits:

750 requests (any operation) per IP address per second to all Spaces on an account
150 PUTs, 150 DELETEs, 150 LISTs, and 240 other requests per second to any individual Space
2 COPYs per 5 minutes on any individual object in a Space

How many files per second are you copying?

Aside: That last limit on COPY caused me to stop running the integration tests on digital ocean - they just didn't work any more!

If you have the memory then using --fast-list will speed up the initial listings and cost you less transactions at s3. Using --checksum will work as well as --size-only in terms of speed but will more reliably detect changes.

Animosity022 · December 28, 2020, 4:57pm

Having a huge backlog I don't think is a great idea either.

Retries means that for each failure it is going to retry 999999 which is going to slow down things enormously when you hit a failure.

kborges · December 28, 2020, 4:59pm

I tried growing the backlog in order to have all files listed, because in another folder it never finished. Since it identified a change, seemed it began again the checking proccess. (But it wasn't so big as this one.)

Animosity022 · December 28, 2020, 5:02pm

How many files/objects is around you max in a directory?

With 99 million objects in the backlog, it's going to take some time to get all that data and the overhead to work against it can't be fast either.

kborges · December 28, 2020, 5:02pm

I was using the default limit. Now I'll try with --tpslimit 180

kborges · December 28, 2020, 5:04pm

Around 8 million objects in this folder.

ncw · December 28, 2020, 5:12pm

The default is no limiting, so this will slow things down to 180 transactions per second. That might be polite to fit in DOs rate limiting though.

kborges · December 28, 2020, 11:08pm

Looks like its running well now.
I had to delete my EC2 instance and create a new one in order to get a new IP address because I was receiving access denied from Digital Ocean. Then I used the command below which took around 20min to start transferring (building the backlog I believe).

rclone --size-only --tpslimit 180 --fast-list --transfers 32 --checkers 48 --max-backlog 99999999 --retries 999999 --log-level INFO --log-file prodreceipts-"`date +"%Y-%m-%d-%H-%M-%S"`".log sync spaces:receipts AWSS3:receipts &

It's running for 3 hours and seems okay.

Transferred:   	  107.479G / 1.272 TBytes, 8%, 11.514 MBytes/s, ETA 1d5h30m52s
Errors:                 3 (retrying may help)
Checks:           1060749 / 1060749, 100%
Transferred:       407541 / 7481889, 5%
Elapsed time:     3h5m0.4s

Is this transfer rate good?

ncw · December 29, 2020, 12:33pm

That is a definite sign you got rate limited.

This appears to be transferring about 37 files per second so the ETA is probably underestimating slightly - I make it slightly more than 2 days. (The ETA is based on data volume which goes a bit wrong for lots of small files).

If DO seems happy with that rate (ie it doesn't rate limit you further) and you don't mind waiting 2 days then I'd stick with that

kborges · December 29, 2020, 1:07pm

The only point is the transfer speed keeps slowing down but in a slower rate.
Now the average is 8.42MB/s

Maybe I'll restart the process if it keeps slowing down.

Transferred:   	  493.483G / 1.272 TBytes, 38%, 8.420 MBytes/s, ETA 1d3h19m25s
Errors:                 4 (retrying may help)
Checks:           1060749 / 1060749, 100%
Transferred:      2608636 / 7481888, 35%
Elapsed time:    17h6m0.4s

Animosity022 · December 29, 2020, 1:11pm

If you get rate limited and a single file fails, it will retry based on your command 999,999 times until it times out and moves to the next, hence the slow down as you get some rate limiting going on and a failure. You have more files that fail, the retries go and go.

kborges · December 29, 2020, 1:13pm

But those retries are per file? I thought it was per entire queue.
I'll keep default on next time, though.

Animosity022 · December 29, 2020, 1:15pm

A retry happens when you get some kind of failure which is at a slightly higher level than low level retries.

   --retries int                          Retry operations this many times if they fail (default 3)

So those 4 errors I'd surmise retried 999,999 times with a exponentially growing retry sleep timer:

 --retries-sleep duration               Interval between retrying operations if they fail, e.g 500ms, 60s, 5m. (0 to disable)

Which slows things down when you hit an error that does not recover.