First and foremost, thanks for the excellent tool, it's fantastic!
We have got an in-house data centre serving all our applications and data services (everything on Kubernetes), we rely on kafka-connect-s3 for exporting our Kafka cluster topics to the Minio object-store as a cold backup option for Kafka (similar to https://jobs.zalando.com/en/tech/blog/backing-up-kafka-zookeeper/). The Kafka cluster is pretty big and some of the topics have got 10M+ records! and it gets exported to our minio cluster, The setup works fine w/o any issues.
I'm currently trying to mirror the minio buckets to another minio cluster running at a different data centre, I initially attempted using minio mirror, which didn't work and spiked the system load, so I have replaced minio mirror with rclone sync, this setup works for all buckets except for the ones containing millions of records, they do not sync a single file even after few hours of running.
I am using the below options for all the buckets which works well for all small buckets and I have tried plenty of rclone command combinations for making big ones (having 10 millions of records) work, but no success yet
Are the 10 million files in one directory or are they in a folder structure? If they are in a folder structure about how many files per folder are there?
Rclone works pretty well for big syncs, but its weak point is millions of files in a single directory - the sync works at the moment by loading the whole directory into memory which for 10 million objects will take quite a while and use lots of memory!
Here are some general tips for s3/minio
use --size-only or --checksum instead of the default mod time. The default reads metadata from each object which doesn't scale well. Since you are copying minio -> minio both sides will have checksums so I'd recommend --checksum
using --fast-list can make a big speed up, however it requires that you have enough memory to fit all the metadata for the files in memory.
How big are the files you are copying? You might want to raise --s3-upload-cutoff so that they are all copied in a single transaction. This will ensure they have an md5sum and is likely to be more efficient for medium size files (say < 1 GB)
If you are uploading large files then tweaking --s3-concurrency and --s3-chunk-size can make a difference at the cost of using more memory.
Do existing files get updated? Or is it write once, then delete? There is a workflow which will work for that...
Many thanks for the detailed response! I will play around with the options you suggested ( --checksum, --s3-upload-cutoff)
Are the 10 million files in one directory or are they in a folder structure? If they are in a folder structure about how many files per folder are there?
They are on a single minio bucket (ie, one directory).
How big are the files you are copying? You might want to raise --s3-upload-cutoff so that they are all copied in a single transaction. This will ensure they have an md5sum and is likely to be more efficient for medium size files (say < 1 GB)
If you are uploading large files then tweaking --s3-concurrency and --s3-chunk-size can make a difference at the cost of using more memory.
The files are pretty small, ranging from 100 bytes to 1Mb max.
Do existing files get updated? Or is it write once, then delete? There is a workflow which will work for that...
These are Kafka messages and hence it's written once and they won't get updated/modified, Do we have any specific options those are well suited for this scenario? Thank you!