OOM with big buckets

glapouge · September 16, 2020, 6:01am

What is the problem you are having with rclone?

OOM aftrer 2 hours of sync at 300 MIOS

What is your rclone version (output from `rclone version` )

rclone 1.53.1

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Centos 7 64

Which cloud storage system are you using? (eg Google Drive)

Scality S3 to Scatily S3

The command you were trying to run (eg `rclone copy /tmp remote:tmp` )

rclone copy PPROD1:bucket01 PPROD2:tfffffrrrffffffestggg999 --no-check-certificate=true --log-file /root/rclonetest.log --checkers 300 --transfers 300 -vv --progress --max-backlog 100000

The rclone config contents with secrets removed.

Paste config here

[PPROD1]
type = s3
provider = Other
env_auth = false
access_key_id = *****
secret_access_key = ******
endpoint = ****
acl = bucket-owner-full-control

[PPROD2]
type = s3
provider = Other
env_auth = false
access_key_id =****
secret_access_key = ****
endpoint = *****
acl =  bucket-owner-full-control


The number of workers and checkers is the only settings worked to have a big transfer bandwidth.
The OOM is done after 2 hours of transfert at 250 MIOS.
The buckets have 700 millions objets of 512 kb.

How to avoid the OOM and keep this transfert rate ?

Animosity022 · September 16, 2020, 12:15pm

You have a huge amount of transfers and checkers.

What's the memory on system you are using? You can try --nmap and see if the memory helps but if you have a lot of objects, there is an open issue with directory caching that would need to solved if you have a huge amount of objects.

ncw · September 16, 2020, 1:30pm

I guess you have millions of items in a single "directory" - is that the case?

Rclone loads them all into memory to sync them which is probably the cause of your problem.

I have a cunning plan to fix this but I haven't implemented it yet!

glapouge · September 16, 2020, 3:54pm

Yes it is the case, we have 700 billions objects in one folder.
Do you have a delay to fix it ?
There is possible to order the change with commercial offer ?

ncw · September 16, 2020, 10:40pm

What do the keys look like? Are they random? How long are they? Could you post some examples?

I have an idea on how to solve this but it would require the keys to be evenly distributed by prefix.

I could potentially do a commercial contract to implement this for you.

glapouge · September 17, 2020, 6:08am

Hello Nick,
I'm not sure to understand what the keys are but there is a extract of the log life during a transfert from a test bucket with 1 million objet

2020/09/17 08:03:13 INFO : blocks/0000afe4/48a3304237720d61000000000000065900000000-esrtay: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000b050/d28beb37fb543a7b000000000000065a00000000-md8wxt: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000b050/d28beb37fb543a7b000000000000065a00000000-md8wxt: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000b092/bf761bb3537d3d51000000000000065700000000-i6o2tp: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000b092/bf761bb3537d3d51000000000000065700000000-i6o2tp: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000962d/8ae178bb2672dd8c000000000000061e00000000-g8iyg: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000962d/8ae178bb2672dd8c000000000000061e00000000-g8iyg: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000b098/bf761bb3518e3d94000000000000065700000000-erha3b: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000b098/bf761bb3518e3d94000000000000065700000000-erha3b: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000b0a1/bf761bb3511d99d3000000000000065700000000-dkbbsa: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000b0a1/bf761bb3511d99d3000000000000065700000000-dkbbsa: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/000099ca/8ae178bb26702b81000000000000061e00000000-g4pyg: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/000099ca/8ae178bb26702b81000000000000061e00000000-g4pyg: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000b0b3/1efa7ea3d6bbb003000000000000062500000000-5v3m7j: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000b0b3/1efa7ea3d6bbb003000000000000062500000000-5v3m7j: Copied (new)
2020/09/17 08:03:13 DEBUG : blocks/0000b062/d28beb37fb5b6928000000000000065a00000000-mdj0bj: MD5 = 386865483615bf8368be7a8f17bfb2d9 OK
2020/09/17 08:03:13 INFO : blocks/0000b062/d28beb37fb5b6928000000000000065a00000000-mdj0bj: Copied (new)

The duration is 8 hours and the rclone thread take 97 GB of RAM now and continue to grow.

The command i use :

rclone copy PPROD1:bucket01 PPROD2:trddddeeeeeeeedddddeal128 --no-check-certificate=true --log-file /root/rclonetest.log --checkers 128 --transfers 128 -vv --progress --use-mmap --stats-one-line --max-backlog 100000
7.102T / 7.150 TBytes, 99%, 165.187 MBytes/s, ETA 5m2s (xfr#15234427/15334616)

glapouge · September 17, 2020, 8:29am

There is possible to use local disk drive instead of RAM ?

ncw · September 17, 2020, 10:06am

Thanks

The key in S3 terminology is this bit blocks/0000afe4/48a3304237720d61000000000000065900000000-esrtay

How many files in a typical directory - that is the limiting factor for rclone memory usage, not the total number of files. You can do something like this

rclone size PPROD1:bucket01/blocks/0000afe4

To find out.

If there aren't too many then I think your problem is that you are iterating too many large directories at once. So reduce --checkers from 128, to say --checkers 8 - this will iterate one directory at a time and use 16 times less memory. I don't think it will slow things down.

glapouge · September 17, 2020, 12:07pm

This this the size of one folder.

[root@MIG01-001 ~]# rclone size PPROD:bucket01/blocks/0000afe4 --no-check-certificate=true -vv
2020/09/17 13:51:37 DEBUG : rclone: Version "v1.53.1" starting with parameters ["rclone" "size" "OBOS-FST-PPROD:fskpp1-bucket01/blocks/0000afe4" "--no-check-certificate=true" "-vv"]
2020/09/17 13:51:37 DEBUG : Using config file from "/root/.config/rclone/rclone.conf"
2020/09/17 13:51:37 DEBUG : Creating backend with remote "OBOS-FST-PPROD:fskpp1-bucket01/blocks/0000afe4"
Total objects: 1532114
Total size: 748.103 GBytes (803268984832 Bytes)
2020/09/17 14:02:21 DEBUG : 4 go routines active

ncw · September 17, 2020, 12:59pm

That is now many objects in the bucket. You need enough memory to hold that many rclone objects in memory at once which is probably approx 1.5GB. If you use --checkers 1 then rclone will only hold one directory in RAM at once and I think you should be able to sync OK.

glapouge · September 17, 2020, 1:31pm

If i undestand well the number of checkers is the number on directories in queue in RAM ?
If i put 4 checker i have 4 folders in the same time in ram ?

glapouge · September 17, 2020, 1:41pm

Other question my bucket hierarchy is
Bucket -- Folder 1
----------- Folder 2 - subFolder 1

If is better to use in 2 jobs rclone copy location:bucket/folder2/ to location instead of rclone copy location:bucket to location

ncw · September 17, 2020, 2:16pm

That is correct.

I would just use 1 job - rclone will recurse through the directories and copy everything.

system · November 17, 2020, 10:16am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.