Rclone copy for large data set 20TB

Srinivas_Kotaru · October 31, 2018, 8:32pm

Hi

We are hosting internal docker registry with 3 data centers. each DC registry nodes connect to DC specific Ceph S3 storage

We found DC B and DC C missing thousand of layers and thus want to copy from DC-A to B & C. Number of files and total size is very huge, close to 20 TB or 1180813 files

The reason i selected rclone vs s3cmd is, rclone seems supporting multi site bucket to bucket copy where s3cmd doesn’t support unless you download first and upload again.

Questions

Don’t want to copy every file from source to destination except missing files. what is best option copy or sync?
in either the case, what are the best optimized flags to improve speed and reliability of the copy/sync command flags like -v --log-file rclone.log --checkers=16 --transfers=16 ???

Can you share the right command with right arguments to speedup the operation for large data sets?

Thanks in advance and appreciate for help

ncw · November 3, 2018, 4:56pm

Either will skip already copied files.

Top tip first: on an s3 remote reading the mod time takes an extra transaction so using --checksum or --size-only will speed up a sync so I’d recommend one of those.

--fast-list may improve performance.

Setting --checkers and --transfers higher will use more network bandwidth and memory at some point it will become counterproducive. The defaults of 4 & 4 are quite conservative. I regularly use 64 & 64.

If you’ve got lots of really big files and you don’t care about keeping the md5sum use

  --s3-disable-checksum                Don't store MD5 checksum with object metadata

Increasing this will help with big files at the cost of memory

  --s3-chunk-size SizeSuffix           Chunk size to use for uploading. (default 5M)

Increase this if you have a small number of big files

  --s3-upload-concurrency int          Concurrency for multipart uploads. (default 2)

Hope that helps!