Slow egress on v1.70.0-beta

average-goat · May 28, 2025, 12:56pm

What is the problem you are having with rclone?

Previously, we had issues with our pods running out of memory (see this post). Since running the recommended beta this issue is solved. However, now since running the beta that writes listing to disk (instead of memory), our egress has been significantly impacted. Before it was consistently going at 1GiB/s. Right now, we are lucky to reach 25MiB/s. See graphs below:

When we were still on rclone v1.69.2-beta.8581.84f11ae44.v1.69-stable:

Now, on rclone v1.70.0-beta.8730.9d55b2411:

Is this expected? I am aware this is beta, but since the memory fix is not yet released in a stable version I have to run this version.

Run the command 'rclone version' and share the full output of the command.

rclone v1.70.0-beta.8730.9d55b2411
- os/version: alpine 3.21.3 (64 bit)
- os/kernel: 6.8.0-60-generic (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.24.3
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

MinIO to Ceph RGW

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone sync source:"prod-bucket"/ target:"prod-bucket"/ --retries=3 --low-level-retries 10 --log-level=INFO --use-mmap --list-cutoff=1000000 --metadata --transfers=50 --checkers=8 --checksum --s3-use-multipart-etag=true --multi-thread-cutoff=256Mi --s3-chunk-size=5Mi

The rclone config contents with secrets removed.

2025/05/07 07:08:18 NOTICE: Config file "/.rclone.conf" not found - using defaults

We are using env variables to set it, but basically should look like:

[minio]
type = s3
provider = minio
access_key_id = xxx
secret_access_key = xxx
endpoint = xxx
region = ""

[ceph]
type = s3
provider = Ceph
access_key_id = xxx
secret_access_key = xxx
endpoint = xxx
sse_customer_algorithm = xxx
sse_customer_key_base64 = xxx
sse_customer_key_md5 = xxx
region = ""

A log from the command with the `-vv` flag

It reports back 531.352 MiB/s now. However, I do not see that number back in our dashboards at all.

[2025-05-21 12:17:33 UTC] INFO: START rclone sync from https://s3.xxx.xxx.net/prod-bucket to https://objectstore.xxx.xxx/prod-bucket
[2025-05-21 12:17:33 UTC] INFO: Executing command: rclone sync source:"prod-bucket"/ target:"prod-bucket"/ --retries=3 --low-level-retries 10 --log-level=INFO --use-mmap --list-cutoff=1000000 --metadata --transfers=50 --checkers=8 --checksum --s3-use-multipart-etag=true --multi-thread-cutoff=256Mi --s3-chunk-size=5Mi
2025/05/21 12:17:33 NOTICE: Config file "/.rclone.conf" not found - using defaults
...
[setting defaults with env]
...
2025/05/21 12:18:33 INFO : 
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 0 / 0, -, Listed 938700	
Elapsed time: 1m0.0s

2025/05/21 12:18:37 NOTICE: S3 bucket prod-bucket: Switching to on disk sorting as more than 1000000 entries in one directory detected
	
2025/05/21 12:19:33 INFO : 
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 0 / 0, -, Listed 1829200	
Elapsed time: 2m0.0s
...
[lots of listing and ultimately transfering]
...
2025/05/28 12:12:33 INFO  : 
Transferred:   	    1.815 TiB / 1.815 TiB, 100%, 531.352 MiB/s, ETA 0s
Checks:          22758806 / 22758806, 100%, Listed 111755913
Transferred:       645494 / 645494, 100%
Elapsed time:  6d23h55m0.0s

ncw · May 29, 2025, 11:15am

I think something is up with the 531.352 MiB/s figure.

Assuming the total transfer is correct

1.815E12/(6243600+23*3600+55)/1024/1024 = 2.87MiB/s which is 23 Mbit/s which agrees with your dashboard.

I think what is happening is there are long periods with no files transferred - this causes the stats to decided nothing is being transferred and it stops counting transfer rate.

Anyway that is not your main issue.

You've transferred 645494 out of 22758806 files so just 2.7% of the files. I wonder if in your previous run it transferred more files?

Is the 6d23h55 run time for the sync more or less than with v1.69.2-beta.8581.84f11ae44.v1.69-stable ?

Could the files in the directory be arranged so the new ones will come at the end of the listing? Maybe they all have a timestamp in the name?

average-goat · May 29, 2025, 12:56pm

Actually, the dashboard axis is also in MiB/s.

It seems to be similar, listing takes a few days. 2-3 days iirc. So, it should have been transferring for about 4 days now. Unless, of course it runs into already transferred objects (so they become checks).

I was thinking something similar. The previous run was based on the in-memory listing. If the listed objects are sorted in any way, the 'new' ones might indeed come later on. I will keep it running for a few more days and see what the impact is. On the other hand, I also did a run on a (empty target) test bucket with v1.70.0-beta.8730.9d55b2411 and the performance was similar.

Today is a bank holiday and tomorrow a bridge day. But on Monday I will do a similar test (with empty target bucket) with v1.69.2-beta.8581.84f11ae44.v1.69-stable, to rule out performance issues on the target side (which is an external storage provider).

ncw · May 30, 2025, 10:45am

I guess that is my concern - are we comparing like with like. The checks will take some time. Note that the new in memory code will be doing HEAD requests for each object whereas the old code wouldn't have been since you were using --checksum.

At the moment we just serialize the name to disk in the sorting process and to rebuild the object rclone uses we need to do a HEAD request. In the future I'd like to serialize the entire object to disk which will get rid of that need for the HEAD request, but in order to get the feature out the door (and not have to add object serialization to every backend!) we decided to go with the simpler version. This may be the cause of the slowdown for you - slowing down the checking phase. You might want to increase --checkers - that will allow more of these to run in parallel. I'd set it to the same value as --transfers.

average-goat · June 2, 2025, 11:28am

So, after a few more days it seems to be transferring again. I will definitely keep this in mind for the other buckets, and update our docu that it is expected behavior.

The egress looks a lot better. I expected it to reach 1 GiB/s, but it seems to hover around 700-800 MiB/s atm. I will let it run for now, would be great to have the complete bucket in our off-site.

Thanks for the feedback on this parameter. I will adjust it for the next run! Here is hoping that run will be check-only.

I hereby want to thank you and all the other rclone contributor for all the hard work!

ncw · June 2, 2025, 1:29pm

Thanks for the update @average-goat and let me know if increasing --checkers doesn't speed it up.

system · June 5, 2025, 1:30pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.