Huge memory usage when copying between S3-like services

What is the problem you are having with rclone?

I'm trying to copy 450k files from one S3-like service to another, but rclone memory consumption is getting it killed.
After about 10min it's already consuming +30GB of RAM, but checked only about 60k files. My VPS has around 60GB of RAM.

I saw the response above and also tried it. It goes much slower, but with constant memory usage for about 1h, then memory starts to raise until the process gets killed. It seems that memory starts to raise when copying from a folder with about 40k files.

I found many posts in this forum complaining about this problem when dealing with many millions of files. This indeed can be a problem in the future for my use case, but is it expected to be a problem even for half a million files?

Maybe I'm missing something, but it should be possible to copy files with constant memory usage, no?

Run the command 'rclone version' and share the full output of the command.

rclone v1.60.0

  • os/version: arch (64 bit)
  • os/kernel: 5.18.12-arch1-1 (x86_64)
  • os/type: linux
  • os/arch: amd64
  • go/version: go1.19.2
  • go/linking: dynamic
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Contabo (Ceph) to IDrive E2.

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone copy -P contabo: e2:  -vv --log-file=/tmp/rclone.log

or

rclone copy -P --checkers=1 --transfers 1 contabo: e2:  -vv --log-file=/tmp/rclone.log

The rclone config contents with secrets removed.

[contabo]
type = s3
provider = Ceph
access_key_id = ...
secret_access_key = ...
endpoint = ...

[e2]
type = s3
provider = IDrive
access_key_id = ...
secret_access_key = ...
acl = private
bucket_acl = private
endpoint = ...

A log from the command with the -vv flag

Many lines ending with:
Size and modification time the same (differ by 0s, within tolerance 1ns)
and
Unchanged skipping
so it seems all ok here.

hello and welcome to the forum,

that is a small amount of files, should not be using so much memory.

not sure the exact reason for the memory usage but is the source and dest are both s3, should consider using --checksum
If the source and destination are both S3 this is the recommended flag to use for maximum efficiency.

have you tested --fast-list

here i sync 1,000,000 files in 33 seconds.
https://forum.rclone.org/t/fastest-way-to-check-for-changes-in-2-5-million-files/25957/11

Hi! Thanks! :smile:

Tested now with --checksum --fast-list --use-mmap (found this last one in the docs now), and a few combinations of them, still blowing up memory.

I'm curious, is there a reason for rclone to use more than constant memory when copying files?

It does use a consistent amount as it's based on objects so more objects == more memory. It's written up much better ncw in a few of those links you've already shared so I won't try to rehash it.

is this a one-time copy or something to be run on a schedule, as there are tweaks we can try.

might test using --dry-run, that will simulate a copy, without using memory to copying files.

and in the debug log, no errors, warning, retries, etc..

This sounds strange to me, that is more than 500 KB RAM per file. I would worst case expect 2KB RAM per file. How did you determine these two numbers?

And to understand your data and possible ways to reduce memory usage:

What is the highest number of objects (files and folders) in a folder?
(How many lines would you see if you did an rclone lsf of the folder)

How long is the typical path and filename?

@asdffdsa

is this a one-time copy or something to be run on a schedule, as there are tweaks we can try.

I started with a copy, but I plan to do a sync periodically.

might test using --dry-run, that will simulate a copy, without using memory to copying files.

Memory usage keeps raising quite fast. At one point it stucks at: Checks: 371615 / 371615, 100%, while memory keeps raising until :boom: .

and in the debug log, no errors, warning, retries, etc..

Nope...


@Ole

This sounds strange to me, that is more than 500 KB RAM per file. I would worst case expect 2KB RAM per file. How did you determine these two numbers?

I checked rclone RAM usage through Glances ( Glances - An Eye on your system ) and top.
And rclone displayed checked files count in the terminal, because of -P.

What is the highest number of objects (files and folders) in a folder?

207k (checked using rclone ncdu)

How long is the typical path and filename?

Around 60 chars.


@Animosity022

It does use a consistent amount as it's based on objects so more objects == more memory. It's written up much better ncw in a few of those links you've already shared so I won't try to rehash it.

Sorry, I couldn't find the explanation. :cry:
I guess it does this to reduce remote requests, specially when doing a sync. But even then, it seems to me it could be done with minimal and constant memory usage. I guess I'll try to implement it in Python with Boto3 later, I'm already using this lib on my app anyway...

for that sync, to reduce the amount of time and api calls, might try something like
--max-age=24h

edit: --max-age has some potential concerns to know about.

1 Like

Can you share a screen shot of top taken shortly before the crash, when rclone is started exactly like this (without log):

rclone copy --dry-run -P --checkers=1 contabo: e2:

I guess it will crash with --dry-run too, otherwise please remove it (if you feel safe about it)

Sounds like your also counted objects in the subfolder of the folder.

I would like (an estimate) of the highest number you can find with a command like this:

rclone lsf yourS3:some/folder | wc -l

where yourS3: is either contabo: or e2: and some/folder is a folder, a bucket or the top level.

Is your folder structure very deep? How many levels?

This will be a bit tricky, because with 1 checker it takes +1h to crash, but I'll try it later. But I can say it uses lots of VIRT and RES, didn't checked SHR.

Nope, I do have all those files in a single folder. Is that bad? :sweat_smile:

rclone lsf contabo:bucket001/aaaa/bb/ccccccc | wc -l
207940

Replaced names, but kept char lengths. Each file there is a number followed by .pdf: 100000.pdf, 100001.pdf...

Got it:

kswapd0 activity should be related, because it was only active near the crash. And, as you can see, that server has no swap.

rclone copy --dry-run -P --checkers=1 contabo: e2:
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Checks:             43448 / 43448, 100%
Elapsed time:     55m16.0szsh: killed

Perfect, thanks!

This doesn't match my expectation and I (currently) cannot explain why this happens.

Here is a quick calculation to let you understand why I am puzzled:

Let's say rclone is checking the folder with 208,000 pdf files, then it will need to collect an entire listing from both remotes before comparing them. Let's say that is 430,000 objects in total, to make calculations easy. (and perhaps only half of this if you haven't transferred anything yet)

There is nothing else being checked concurrently (because checkers=1) and there are no transfers ongoing (because --dry-run). So the majority of the 43G RAM used are being used by (something related to) the 430,000 objects.

That is rclone uses 100K RAM per object, where we usually have a rule of thumb of 1-2K.

I will now do some research and small scale tests to understand this better, and possibly call for more expertise.

One of the things I will investigate is --mem-profile, but I haven't tried it before - feel free to try it.

1 Like

This looks like a bug of some kind to me

The most useful thing you could do would be to run with --rc and follow these instructions to create a memory profile.

This will tell us what rclone is using the memory for.

A rule of thumb is 1k per object, so even loading 450k objects into memory, I would hope that wouldn't take more than 500M of memory.

A memory profile will tell us more.

This one was made without checkers=1. I'm not sure why it's "only" using 5GB. I'm running one with checkers=1 and I'll try to compare the value shown at top with the one in the profile.

File: rclone
Build ID: 7d0be04dd2a65379b62ff8b8ff946b0d410aa99a
Type: inuse_space
Time: Oct 25, 2022 at 7:19am (-03)
Showing nodes accounting for 5271.18MB, 99.37% of 5304.66MB total
Dropped 143 nodes (cum <= 26.52MB)
      flat  flat%   sum%        cum   cum%
 3653.03MB 68.86% 68.86%  4938.16MB 93.09%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.XMLToStruct
 1099.55MB 20.73% 89.59%  1099.55MB 20.73%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.(*XMLNode).findNamespaces (inline)
  218.53MB  4.12% 93.71%   218.53MB  4.12%  github.com/rclone/rclone/backend/s3.(*Fs).newObjectWithInfo
     139MB  2.62% 96.33%   140.56MB  2.65%  encoding/xml.(*Decoder).name
  101.51MB  1.91% 98.25%   101.51MB  1.91%  path.Join
      27MB  0.51% 98.76%       27MB  0.51%  encoding/xml.CharData.Copy (inline)
   15.06MB  0.28% 99.04%   233.59MB  4.40%  github.com/rclone/rclone/backend/s3.(*Fs).listDir.func1
    9.50MB  0.18% 99.22%   158.58MB  2.99%  encoding/xml.(*Decoder).Token
       8MB  0.15% 99.37%   149.08MB  2.81%  encoding/xml.(*Decoder).rawToken
         0     0% 99.37%   140.56MB  2.65%  encoding/xml.(*Decoder).nsname
         0     0% 99.37%  4942.63MB 93.18%  github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run
         0     0% 99.37%  4942.11MB 93.17%  github.com/aws/aws-sdk-go/aws/request.(*Request).Send
         0     0% 99.37%  4941.61MB 93.16%  github.com/aws/aws-sdk-go/aws/request.(*Request).sendRequest
         0     0% 99.37%  4942.13MB 93.17%  github.com/aws/aws-sdk-go/private/protocol/restxml.Unmarshal
         0     0% 99.37%  4941.62MB 93.16%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.UnmarshalXML
         0     0% 99.37%  3708.46MB 69.91%  github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2WithContext
         0     0% 99.37%  1233.65MB 23.26%  github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsWithContext
         0     0% 99.37%  5279.20MB 99.52%  github.com/rclone/rclone/backend/s3.(*Fs).List
         0     0% 99.37%   218.53MB  4.12%  github.com/rclone/rclone/backend/s3.(*Fs).itemToDirEntry
         0     0% 99.37%  5279.20MB 99.52%  github.com/rclone/rclone/backend/s3.(*Fs).list
         0     0% 99.37%  4942.11MB 93.17%  github.com/rclone/rclone/backend/s3.(*Fs).list.func1
         0     0% 99.37%  5279.20MB 99.52%  github.com/rclone/rclone/backend/s3.(*Fs).listDir
         0     0% 99.37%  1233.65MB 23.26%  github.com/rclone/rclone/backend/s3.(*v1List).List
         0     0% 99.37%  3708.46MB 69.91%  github.com/rclone/rclone/backend/s3.(*v2List).List
         0     0% 99.37%  4942.11MB 93.17%  github.com/rclone/rclone/fs.pacerInvoker
         0     0% 99.37%  5279.20MB 99.52%  github.com/rclone/rclone/fs/list.DirSorted
         0     0% 99.37%  5279.20MB 99.52%  github.com/rclone/rclone/fs/march.(*March).makeListDir.func1
         0     0% 99.37%  1305.67MB 24.61%  github.com/rclone/rclone/fs/march.(*March).processJob.func1
         0     0% 99.37%  3973.53MB 74.91%  github.com/rclone/rclone/fs/march.(*March).processJob.func2
         0     0% 99.37%  4942.11MB 93.17%  github.com/rclone/rclone/lib/pacer.(*Pacer).Call
         0     0% 99.37%  4942.11MB 93.17%  github.com/rclone/rclone/lib/pacer.(*Pacer).call
      flat  flat%   sum%        cum   cum%
 3653.03MB 68.86% 68.86%  4938.16MB 93.09%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.XMLToStruct
 1099.55MB 20.73% 89.59%  1099.55MB 20.73%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.(*XMLNode).findNamespaces (inline)
  218.53MB  4.12% 93.71%   218.53MB  4.12%  github.com/rclone/rclone/backend/s3.(*Fs).newObjectWithInfo

Not expert, but looks like this is sufficient - suspecting an issue in the aws sdk used by rclone (or the way rclone uses it).

Rough observation: rclone itself using 300MB (as expected) and the aws sdk using additionally 4.6 GB...

I guess this can be seen in a smaller scale too, now we know what to look for.

1 Like

Great! :smile:

Now one with checkers=1, from when top said rclone was using about 20GB. No idea why the profile says 10GB...

File: rclone
Build ID: 7d0be04dd2a65379b62ff8b8ff946b0d410aa99a
Type: inuse_space
Time: Oct 25, 2022 at 9:04am (-03)
Showing nodes accounting for 9894.58MB, 99.25% of 9969.22MB total
Dropped 164 nodes (cum <= 49.85MB)
      flat  flat%   sum%        cum   cum%
 6931.02MB 69.52% 69.52%  9300.21MB 93.29%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.XMLToStruct
 2040.09MB 20.46% 89.99%  2040.09MB 20.46%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.(*XMLNode).findNamespaces (inline)
  399.05MB  4.00% 93.99%   399.05MB  4.00%  github.com/rclone/rclone/backend/s3.(*Fs).newObjectWithInfo
     247MB  2.48% 96.47%   248.54MB  2.49%  encoding/xml.(*Decoder).name
  205.02MB  2.06% 98.53%   205.02MB  2.06%  path.Join
   42.89MB  0.43% 98.96%   441.95MB  4.43%  github.com/rclone/rclone/backend/s3.(*Fs).listDir.func1
   19.50MB   0.2% 99.15%   270.09MB  2.71%  encoding/xml.(*Decoder).rawToken
      10MB   0.1% 99.25%   280.09MB  2.81%  encoding/xml.(*Decoder).Token
         0     0% 99.25%   248.54MB  2.49%  encoding/xml.(*Decoder).nsname
         0     0% 99.25%  9306.68MB 93.35%  github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run
         0     0% 99.25%  9306.68MB 93.35%  github.com/aws/aws-sdk-go/aws/request.(*Request).Send
         0     0% 99.25%  9305.67MB 93.34%  github.com/aws/aws-sdk-go/aws/request.(*Request).sendRequest
         0     0% 99.25%  9305.67MB 93.34%  github.com/aws/aws-sdk-go/private/protocol/restxml.Unmarshal
         0     0% 99.25%  9305.67MB 93.34%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.UnmarshalXML
         0     0% 99.25%  8992.14MB 90.20%  github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2WithContext
         0     0% 99.25%   313.54MB  3.15%  github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsWithContext
         0     0% 99.25%  9953.14MB 99.84%  github.com/rclone/rclone/backend/s3.(*Fs).List
         0     0% 99.25%   399.05MB  4.00%  github.com/rclone/rclone/backend/s3.(*Fs).itemToDirEntry
         0     0% 99.25%  9953.64MB 99.84%  github.com/rclone/rclone/backend/s3.(*Fs).list
         0     0% 99.25%  9305.67MB 93.34%  github.com/rclone/rclone/backend/s3.(*Fs).list.func1
         0     0% 99.25%  9953.14MB 99.84%  github.com/rclone/rclone/backend/s3.(*Fs).listDir
         0     0% 99.25%   313.54MB  3.15%  github.com/rclone/rclone/backend/s3.(*v1List).List
         0     0% 99.25%  8992.14MB 90.20%  github.com/rclone/rclone/backend/s3.(*v2List).List
         0     0% 99.25%  9306.68MB 93.35%  github.com/rclone/rclone/fs.pacerInvoker
         0     0% 99.25%  9953.14MB 99.84%  github.com/rclone/rclone/fs/list.DirSorted
         0     0% 99.25%  9953.14MB 99.84%  github.com/rclone/rclone/fs/march.(*March).makeListDir.func1
         0     0% 99.25%   335.06MB  3.36%  github.com/rclone/rclone/fs/march.(*March).processJob.func1
         0     0% 99.25%  9618.08MB 96.48%  github.com/rclone/rclone/fs/march.(*March).processJob.func2
         0     0% 99.25%  9306.68MB 93.35%  github.com/rclone/rclone/lib/pacer.(*Pacer).Call
         0     0% 99.25%  9306.68MB 93.35%  github.com/rclone/rclone/lib/pacer.(*Pacer).call
1 Like

That is very useful thanks.

The difference between the top value and the amount of ram Go thinks it is usuing is normal and it is to do with memory fragmentation and memory which hasn't been released back the the OS yet.

This is direct from the AWS SDK.

I note you are using Ceph so I guess this could be some sort of compatibility issue?

Could you generate an svg from the memory profile and attach it? That has a lot more info in it.

Did you try any previous versions of rclone? It might be worth trying some older versions to see if they have the same problem - this will tell us whether it is a problem with a specific version of the SDK.

I looked through recent bugs in the SDK and I couldn't see any with memory issues.

You can also do this to see how much memory each object takes on average. Point it at a subdirectory that you know will finish.

$ rclone test memory s3:rclone
2022/10/25 15:41:53 NOTICE: 62 objects took 175248 bytes, 2826.6 bytes/object
2022/10/25 15:41:53 NOTICE: System memory changed from 35239176 to 35239176 bytes a change of 0 bytes

Sorry, not as easy as I thought.

I tried with 1000 objects on AWS/Windows without seeing any trace of this, so I suggest you try the above ideas from Nick.

Hi! Sorry for the delay! I'm still waiting for a confirmation from their support, but it seems IDrive E2 has a bug...
When listing files from that folder with +80k files it keeps listing the same files forever. That's why rclone size also doesn't work. The same problem happens with boto3. So I really think it's a problem in their API implementation, not in rclone.

1 Like

Ah, interesting.

You may be able to work around this with

--s3-list-version

Or maybe --s3-list-chunk