Folder with millions of files

rudy · November 8, 2018, 11:55pm

Hi everyone,

I’ve been trying to sync a bucket from S3 to Wasabi with no success. There are some folders with a few hundred files (images, mostly) which do get synced successfully, but one folder in particular with about 6 millions images does not sync a single file even after hours running.

I feel I must be doing something wrong or not using all the recommended parameters for that kind of situation. Can anyone please give me a hint or point me to the right direction?

I have tried a lot of commands - from the classic “rclone sync s3:bucket wasabi:bucket” to a more desperate and I-have-no-idea-what-I-am-doing “rclone sync s3:bucket wasabi:bucket --verbose --cache-chunk-no-memory --progress --fast-list --checksum” and a good deal of combinations/possibilities inbetween.

Thank you for your time. I’m really sorry if this is not working by my fault.

ncw · November 9, 2018, 11:09am

6 million files in one directory? That is a lot! However it should work provided your computer has enough memory.

However rclone fetches directories in 1000 file chunks (API limitations) and it has to fetch the entire directory before it starts the sync, so it will take ~6,000 http transactions before that directory starts listing. rclone doesn't list the directory in parallel (s3 doesn't support that alas) so you need to wait for the 6000 http transactions to happen.

You want to use --checksum or --size-only to avoid metadata reads on the individual objects.

I'm not sure --fast-list will really help here.

--transfers set larger will help when you get transferring files, try --transfers 64.

I think doing this but waiting longer is probably your best bet.

Where are you doing this transfer from? You want to try to get the time of that http list transaction as low as possible - maybe try from ec2?

One thing you could try is list the bucket to a file. This will likely take a very long time, but you will see output immediately.

rclone lsf --files-only s3:bucket > file-list

Then use this as an input to a sync

rclone copy --checksum --files-from file-list s3:bucket wasabi:bucket

You will need to use the latest beta for this and then it will not list the directory and only transfer the files listed. It will do metadata reads for each file though, but that is as part of the transfer.

That will take the same time or longer than the first approach, but does have the advantage that it is easy to restart.

Are you planning on doing this sync regularly? Or is it a one off?

rudy · November 9, 2018, 4:54pm

Yeah, I'm inclined to write some code to move things around and update the database. If all those files were spread around a few hundred thousand folders, would that be easier and less resource-hungry to migrate/backup using your tool? (BTW, thank you very much for your work! Really useful code!)

I tried to wait but since the network usage graphic indicated to me that no calls were being made, I assumed waiting longer would be useless. If it was an issue with insufficient memory, the command would always crash, right? (as it happened before and I upgraded the EC2 and created some swap memory)

Yeah, I'm using EC2 for this.

I like this approach but you said I should see some output immediately, right? I'm into this more than 1 hour and not a single line was written to the file. The system is idle as if nothing is happening.

This is a one-off.

ncw · November 9, 2018, 10:28pm

Yes that would be quicker!

Hmm, no network calls is suspicious...

Hmm..

Can you try adding rclone lsf --files-only s3:bucket -vv --dump bodies to that command and it will print the http transactions.

I wonder if it is s3 taking absolutely ages to list the directory... You'll see immediately if it is or not.

You could try adding --fast-list to the above which won't output files instantly, but you may (or may not!) see lots of http transactions.