Unable to migrate huge S3 Bucket (500 millions objects / around 100Tb)

Cobalt · April 27, 2022, 11:30am

What is the problem you are having with rclone?

I'm currently trying to migrate a quite huge buckets ( around 500 millions d'objects and 100Tb) to another bucket on another remote.

I tried with few options but it fails with memory errors (too much usage of memory)

Lately, I'm trying with following command but nothing happens :

rclone copy remote_src:bucket_source remote_tgt:bucket_target  --transfers=160 --checkers=16 --max-backlog=100000 --use-mmap --log-file test.log --size-only --retries 3 --checksum -vv

Once I start the command, I can see that rclone is taking a lot of memory but not doing any copies :

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
16243 cloud-u+  20   0   41.3g  39.9g  14064 S   0.7 63.6 510:49.10 rclone

Do I miss some specific parameters that can improve or just help for copy ?

Run the command 'rclone version' and share the full output of the command.

Here is my rclone version

$rclone version
rclone v1.57.0
- os/version: centos 7.8.2003 (64 bit)
- os/kernel: 3.10.0-1127.8.2.el7.x86_64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.17.2
- go/linking: static
- go/tags: none

Which cloud storage system are you using? (eg Google Drive)

I'm using on premises S3 (Scality) on both source and target

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone copy remote_src:bucket_source remote_tgt:bucket_target  --transfers=160 --checkers=16 --max-backlog=100000 --use-mmap  --size-only --retries 3 --checksum -vv

The rclone config contents with secrets removed.

[remote_src]
type = s3
provider = Other
access_key_id = *************
secret_access_key = *************
endpoint = XXXX

[remote_tgt]
type = s3
provider = Other
access_key_id = *************
secret_access_key = *************
endpoint = XXXX

A log from the command with the `-vv` flag

Log are not really verbose. Since I launch the command, I only receive a starting action and nothing since.

2022/04/27 07:58:08 DEBUG : rclone: Version "v1.52.2" starting with parameters ["/home/cloud-user/rclone-v1.52.2-linux-amd64/rclone" "copy" "remote_src:bucket_source" "remote_tgt:bucket_target" "--transfers=160" "--checkers=16" "--max-backlog=100000" "--use-mmap" "--size-only" "--retries" "3" "--checksum" "-vv" ]
2022/04/27 07:58:08 DEBUG : Using config file from "/home/cloud-user/.config/rclone/rclone.conf"
2022/04/27 07:59:08 INFO  :
Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Elapsed time:         0.0s

Animosity022 · April 27, 2022, 11:37am

That's a really old version of rclone.

Can you try with the current version?

Install (rclone.org)

Cobalt · April 27, 2022, 11:44am

Sorry, my bad, I have two versions installed.

I put the wrong output. I'm using version 1.57

Here is the good output :

2022/04/27 11:41:06 DEBUG : rclone: Version "v1.57.0" starting with parameters ["/home/cloud-user/rclone-v1.57.0-linux-amd64/rclone" "copy" "remote_src:bucket_source" "remote_tgt:bucket_target" "--transfers=160" "--checkers=16" "--max-backlog=100000" "--use-mmap" "--size-only" "--retries" "3" "--checksum" "-vv"]
2022/04/27 11:41:06 DEBUG : Creating backend with remote "remote_src:bucket_source"
2022/04/27 11:41:06 DEBUG : Using config file from "/home/cloud-user/.config/rclone/rclone.conf"
2022/04/27 11:41:06 DEBUG : Creating backend with remote "remote_tgt:bucket_target"
2022/04/27 11:42:06 INFO  :
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:       1m0.0s

Animosity022 · April 27, 2022, 11:49am

What's the directories look like? Is it one big directory with 500m objects?

Why did you adjust the backlog to 100,000?

Cobalt · April 27, 2022, 11:57am

Yes, it's one big directory with 500 millions objects.

I tried to put backlog to 100000 because I thought it would have reduce my memory consumption.

Animosity022 · April 27, 2022, 12:11pm

The default it 10,000 so you made it 10 times the default.

Few posts about this before as 500m in a single directory is going to be painful.

Folder with millions of files - Help and Support - rclone forum

Cobalt · April 27, 2022, 1:14pm

I see, thank you for reply.

I launch with the option --dump bodies and I see the http requests.

I don't really know how I will be able to migrate it then...

As I don't really care about the number of HTTP request, is there a solution to say, as soon as you see a file, copy it ? Even if it's slow ?

Regards

Animosity022 · April 27, 2022, 1:19pm

You are definitely outside of my area as I don't use S3 as that's a massive amount of files in a single spot.

Rclone does very well but at some point, I'm not sure where the scale breaks. I feel based on other posts I read was in the 'millions of items in a single directory' and you have 500 million in a single directory? I think it'll work but take quite a bit of resource to do that and not sure how long.

asdffdsa · April 27, 2022, 1:31pm

hi,

another way to reduce memory usage is to reduce --transfers.
https://rclone.org/s3/#multipart-uploads
"Multipart uploads will use --transfers * --s3-upload-concurrency * --s3-chunk-size extra memory"

and might want to use multiple runs of rclone.

get list of source files using rclone ls > file.lst
split file.lst into multiple files, file01.lst, file02.lst, filexx.lst
rclone copy source dest --files-from=file01.lst`

Cobalt · April 27, 2022, 2:05pm

After checking more deeply, in fact I think it's even worse.

In fact, the bucket contains around almost 500 millions folder that each contains one file. I was thinking it was directly files but no.

I saw the solution to extract in a file but I don't find it really efficient. If there is no other solution, I might use it.

Regards

ncw · April 27, 2022, 3:53pm

Rclone will use something like 1k or RAM per object or folder in a directory, so those 500 million directories will take something like 500 GB of RAM....

What you can do is do a sync using a file list, so something like

rclone lsf --files-only -R remote_src:bucket_source > source-files

This should complete without using too much RAM.

Then you can use this to do the copy. Note the --no-traverse prevents the directory listing and the --no-check-dest prevents rclone checking the source already exists which will speed things up.

rclone copy --files-from source-files --no-traverse --no-check-dest remote_src:bucket_source remote_tgt:bucket_target

I think you'll need to split the source-files up into chunks (say 100,000 lines) so that step doesn't use too much memory either and run it multiple times.

I'd like to make rclone deal with this more sensibly and I have ideas for doing that - maybe your company would like to sponsor me to add those features?

Cobalt · April 29, 2022, 8:24am

Hello,

Thank you for this reply. I think I will try the solution you mention.

For the sponsor part, unfortunately, I can't speak for my company. I will check if something can be done or not.

ncw · April 29, 2022, 11:21am

Great - let me know if you need more help

Thanks

dave · May 6, 2022, 6:55pm

I had to migrate a large number of objects (~250TB, ~1.8 M objects) from an S3 bucket to an S3 bucket in a different AWS account...

I used the AWS CLI (aws s3 sync ...) with max_concurrent_requests = 1000 and max_queue_size = 100000 to do the initial copy because it is very fast (~1.5Gb/s vs ~15Mb/s for rclone). It took ~3-4days.

I'm sure I could speed up rclone a lot by careful tuning of parameters (concurrent transfers etc.), but, given the 100x speedup it was easier to just use the AWS tool to get the job done.

I'm now using rclone to sync the modification times (AWS CLI doesn't do that) and update the last few objects, but it's really, really, slow, like it's taking a week.
I emphasise:

I'm an rclone noob and I've not done any tuning other than --multi-thread-streams 10, so I'm sure rclone can go much faster, I just haven't bothered to try.
moving huge numbers of objects and/or huge amounts of data is always going to be slow very slow no matter how you do it
The best answer, unfortunately, is "Don't do it"".
Your company should restructure their data so they don't have to do these kinds of things.

One more warning: you are going to be doing at least 2 API requests for each object:
have you considered how much money >1billion AWS S3 API requests is going to cost?

system · June 5, 2022, 6:55pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.