Rclone sync: very high memory usage

average-goat · May 7, 2025, 3:50pm

What is the problem you are having with rclone?

We are running rclone to sync data from our OpenShift MinIO Cluster to (external) S3 Ceph RGW. The source bucket has ~91 million objects. We are using rclone with cron, utilizing Kueue in order to queue the jobs (in order to remain within request rate limits). We are noticing that it using an enormous amount of memory, which keeps steadily increasing. There does not seem to be memory written back to the OS. We have set the resource limit at 200Gi, but it will try to eat more. At some point we have just stopped the job, to prevent the node (that it runs on) from crashing.

❯ kubectl top pods
NAME                                                      CPU(cores)   MEMORY(bytes)   
prod-bucket-2505011141-gp9pk                              6930m        204204Mi

Just to note, most of the objects are stored in the root. I am mentioning this, because I read this.

Is moving the objects into sub directory really the only way to prevent the excess memory usage, or have there been development in this field since then?

Run the command 'rclone version' and share the full output of the command.

rclone v1.69.2-beta.8581.84f11ae44.v1.69-stable
- os/version: alpine 3.21.3 (64 bit)
- os/kernel: 5.14.0-427.62.1.el9_4.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.24.2
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Source: MinIO
Target: S3 Ceph RGW

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone sync   source:"prod-bucket"/   target:"prod-bucket"/   --retries=3   --low-level-retries 10   --log-level=INFO   --fast-list   --metadata   --transfers=50   --checkers=8   --checksum   --s3-use-multipart-etag=true   --multi-thread-cutoff=256Mi   --s3-chunk-size=5Mi

The rclone config contents with secrets removed.

2025/05/07 07:08:18 NOTICE: Config file "/.rclone.conf" not found - using defaults

We are using env variables to set it, but basically should look like:

[minio]
type = s3
provider = minio
access_key_id = xxx
secret_access_key = xxx
endpoint = xxx
region = ""

[ceph]
type = s3
provider = Ceph
access_key_id = xxx
secret_access_key = xxx
endpoint = xxx
sse_customer_algorithm = xxx
sse_customer_key_base64 = xxx
sse_customer_key_md5 = xxx
region = ""

A log from the command with the `-vv` flag

It will take several days to list, so might not useful atm:

[2025-05-07 14:57:00 UTC] INFO: START rclone sync from https://s3.xxx.xxx.net/prod-bucket to https://objectstore.xxx.xxx/prod-bucket
[2025-05-07 14:57:00 UTC] INFO: Executing command: rclone sync   source:"prod-bucket"/   target:"prod-bucket"/   --retries=3   --low-level-retries 10   --log-level=DEBUG   --use-mmap   --metadata   --transfers=50   --checkers=8   --checksum   --s3-use-multipart-etag=true   --multi-thread-cutoff=256Mi   --s3-chunk-size=5Mi
2025/05/07 14:57:00 DEBUG : Configuration directory could not be created and will not be used: mkdir /config: permission denied
2025/05/07 14:57:00 DEBUG : rclone: Version "v1.69.2" starting with parameters ["rclone" "sync" "source:prod-bucket/" "target:prod-bucket/" "--retries=3" "--low-level-retries" "10" "--log-level=DEBUG" "--use-mmap" "--metadata" "--transfers=50" "--checkers=8" "--checksum" "--s3-use-multipart-etag=true" "--multi-thread-cutoff=256Mi" "--s3-chunk-size=5Mi"]
2025/05/07 14:57:00 DEBUG : Creating backend with remote "source:prod-bucket/"
2025/05/07 14:57:00 NOTICE: Config file "/.rclone.conf" not found - using defaults
...
[setting defaults with env]
...
2025/05/07 14:57:00 DEBUG : fs cache: renaming cache item "target:prod-bucket/" to be canonical "target{DaWQt}:prod-bucket"
2025/05/07 14:58:00 INFO  : 
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       1m0.0s

2025/05/07 14:59:00 INFO  : 
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       2m0.0s

kapitainsky · May 7, 2025, 4:07pm

It is known problem addressed in the latest beta (v1.70).

Here you are more details:

github.com/rclone/rclone

Excess memory use when syncing millions of files in one directory

opened 08:44AM - 25 Jul 24 UTC

closed 07:28PM - 08 Apr 25 UTC

zackees

bug

## Background - Amazon S3 rclone problems I'm trying to backup a datalake wit…h 100 million files at the root. They are mostly small files < 1mb. rclone was simply not designed for this use case and will eat up all available memory and then crash. There was no machine instance that I could throw at it that would fix this issue. Even running locally in a docker instance would eat up all available memory and then crash. All advice in the forums did nothing to help the situation. And a lot of people seem to be running into this. Therefore I wanted to post this here so that anyone searching for this problem can try our solution. ## Solution in a nutshell: PUT YOUR FILES INTO FOLDERS!!!! What's interesting is the behavior: rclone would never start transferring files, it would always sit there saying 0 files transferred, 0 bytes transferred, eat up all available memory before crashing with an OOM. I tried all the suggestions in the forums, reducing the buffering memory, reducing the number of checkers and transfers. Nothing worked. ## Cause & Fix Without looking at the code or doing any profiling, my hypothesis was that rclone scans an all files in a"directory" into RAM before executing on it. This seems true whether not `--fast-scan` is used or not. Obviously, having 100 million files at the root was causing our org a whole bunch of problems anyway and it's been something that I've wanted to fix for a while, so this problem gave me enough reason to go ahead and re-organize our entire datalakes. Since each file is referenced in our database with a datestamp, I was able to write python scripts that would move these files from the root into folders by the service and year-month (for example name.html -> service/2023-04/name.html) This worked extremely well and I was able to now run rclone and have it at least start transferring some files. However, there were still folders with 5+ million files, and eventually ran into the same out-of-memory error. So again, I further re-organized the files in our datalake into service/yrmo/day. And now that seems to have done the trick. rclone now consistently runs under 2GB memory and I've been able to increase the number of transfers and checkers up to 100 each and have 3mb of buffer per transfer. ## Dead ends All the advice about adjust memory buffers and number of transfers is mostly wrong. They will only cut your minimum memory usage by a constant factor, but will do very little to prevent the absolute unbounded memory that rclone uses for extremely large "directories". If you have this same problem, no amount of setting tweaking will work... you MUST re-organize your data into folders or rclone will just run out of memory every single time. If you have too many files at the root, rclone will simply never start transferring anything and just crash. If one of your subdirectories is too big, you'll get a memory pattern that looks like this: ![image](https://github.com/user-attachments/assets/cad7926e-48b7-46f7-be81-543edc896460) ## Recommendations to the Devs of rclone Please serialize your directory scans to disk if you start exceeding a certain threshold of memory or files in the current directory. You can probably just get away with just always doing that since the disk is so much faster than network anyway. I'm currently doing an inventory scan of our datalakes and 50 million files entries is only taking up 12 gb of disk without any fancy compression. I know you are storing a lot more file information, like metadata, so it could easily be double or triple that. But it is simply so much easier and cheaper to allocate disk space to a docker instance than it is to get a machine with much more ram. An additional pain point about an out of memory issue crash is that when the rclone process gets a kill signal, it will **exit 0** making it look like it succeeded. According to this thread https://github.com/rclone/rclone/issues/7966 this is a feature of linux and you must get the exit code from the operating system instead of the return value of the exited rclone process. This is super scary if you are relying on rclone to backup your datalake but in reality, it starts failing because one of your directories has millions of files in it. I know on Digital Ocean it's easy to see that a docker instance has failed, on Render.com however you'll get a "Run Succeeded" and it's not until you look at the run history that you'll see that in fact your instance ran out of memory. I'm not sure about the other hosting providers. Anyway, I'm glad this huge task is finally over with, and we have started syncing up our data for redundancy and backup purposes. So far so good! ![image](https://github.com/user-attachments/assets/5c2b7096-ade2-45d7-993e-4fd0b0f52196)

It has been already merged into the main branch so feel free to download the latest beta and try.

average-goat · May 12, 2025, 7:29am

Thanks, that seemed to work! Memory is much more stable now (tested with a test-bucket of ~300K objects tho, set --list-cutoff to 100K and it wrote to disk instead of memory). Any indication when this feature will be out of beta (into 1.70 release)?

kapitainsky · May 12, 2025, 7:48am

You can see such indications here:

It is already over due so should be soon.

system · May 15, 2025, 7:49am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.