Migrate from S3 to another S3 150TB

DavideFranchioni · April 30, 2020, 1:39pm

Hello,

I need to migrate from one S3 on prem to another S3 on prem.

Rclone will be installed on a machine with CentOS 7.

Migration should synchronize buckets, both object storage supporting server side copy.

I would need to migrate about 150TB, I wanted to understand together if Rclone is effective even for these large amounts of data.

The idea is to have the structure of the source bucket identical to that of the recipient bucket, including empty prefixes (folders).

Bucket can contain both large and small files.

I wrote this command, according to you is complete? do you have suggestions such as custom and tuning parameters?

rclone sync source:/source-bucket destination:/destination-bucket -P -v --log-file /var/log/rclone/rclone-1.log --create-empty-src-dirs --s3-chunk-size 20M --s3-copy-cutoff 250M -s3-upload-concurrency 64

What are the hardware resources needed to run with the given parameters?

I have 10gigabit connections, how do I have an estimate of time needed?

Thank you in advance

asdffdsa · April 30, 2020, 1:46pm

there are no special resources needed.

do a test run and then you will know

DavideFranchioni · April 30, 2020, 3:54pm

This is not what i'm looking for, you didn't answer any of my questions.

asdffdsa · April 30, 2020, 4:07pm

you asked "how do I have an estimate of time needed" and i think i answered that.
rclone calculates checksums for every file copied.
so your performance is much more then just a theoretical max network speed.
what is your cpu, how many cores, how many threads can it handle. how much ram do you have?
is the server dedicated just to rclone, are there other users,

you have a lot of data to move.
are the s3 files in use, are you able to saturate your connection without interfering?

i was suggesting what i do.
i often install rclone for different customers with different systems.
so i do about 10 test runs, tweaking the parameters and then i know for sure.

i would tweak --transfers and --checkers and add --progress to see the bandwidth used.

DavideFranchioni · April 30, 2020, 4:15pm

Thanks for the --transfers and --checkers, progress already added with -P.
both infrastructure are cluster with a load balancer in front end, so probably the loadbalancer will be my bottleneck.

Did you never try data sync with such big buckets?

I would like to avoid any possible program limitation, like array size, objects ecc..

thank you

asdffdsa · April 30, 2020, 4:25pm

-s3-upload-concurrency should be --s3-upload-concurrency

many uses have copied larger data sets than you.
i have done a couple of 30+TB transfers.

rclone limits are dependent on the flags you use and the cpu and ram.
perhaps your computer cannot handle your settings or can handle much more.

you can get an estimate by reading this
https://rclone.org/s3/#multipart-uploads
" Multipart uploads will use --transfers * --s3-upload-concurrency * --s3-chunk-size extra memory. Single part uploads to not use extra memory."

ncw · May 1, 2020, 3:48pm

It shouldn't be a problem. I've had reports of bioinformatics teams moving petabytes of data in a similar way with rclone.

One thing to ask is what is the size distribution of the files and how many are there?

Do you have directories with millions of files in (without subdirectories) - this will cause rclone to use lots of memory.

DavideFranchioni · May 1, 2020, 5:08pm

Hello,

i get more specifications.
case 1 )i have 19 buckets that totaly contain 90TB and 500 milion of objects.
case 2) i have 2 bucket that totaly contain 50TB and 500 milion of objects

case 1)
each bucket should contain 26.315.789 objects
the average size of objects should be 193.27 Kilobyte

case 2)
each bucket should contain 250.000.000‬ objects
the average size of objects should be 107.37 Kilobyte

Having those specs, how would you change the execute command?

actually idk if all file are in a single bucket, of if have folder and subfolder.

How much ram should i provide in your opinion?

Thank you in advance.

ncw · May 2, 2020, 9:25am

OK, so lots of small objects...

I'd recommend using the --checksum flag - this will save transactions for an S3 to S3 copy.

What the keys in the buckets look like is important. If the keys have / in them so simulating a directory heirachy rclone can copy them by loading each "directory" at once.

However if there is no directory structure rclone will have to load them all into RAM at once. For 250M objects that will take lots of RAM! So much RAM that it might actually make copying with rclone impossible...

Assuming all the files aren't in one directory then rclone will hardly use any RAM - it will be using mostly network with a bit of CPU. A 4GB VM would be plenty I'd say.

Will you be repeating the copy, so trying to keep the source and destination in sync?

DavideFranchioni · May 2, 2020, 10:12am

Hello,

->However if there is no directory structure rclone will have to load them all into RAM at once. For 250M objects that will take lots of RAM! So much RAM that it might actually make copying with rclone impossible...

Any idea of possible ram usage?128GB,256GB?

->Will you be repeating the copy, so trying to keep the source and destination in sync?

There are some application working on source bucket, we need to migrate the data and the start the application on destination bucket. And then we will remove the old infrastructure.
So yes probably we will start the sync more then one time, to keep everything in sync.

thank you

ncw · May 2, 2020, 4:43pm

Between 250-500 bytes per object I'd guess so 64-128GB

If you are doing repeated syncing then it will use more memory as it has to hold the source objects and the destination objects in memory to compare them, so twice what I said above.

However I think that you can probably use --no-traverse to stop rclone caching the destination objects.

system · July 2, 2020, 12:43pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.