I'm trying to copy more than a billion small files from an S3 compliant object storage to an another one. The total size on the files are arround 150TB. My idea is to copy each file one by one because the whole process must be able to follow. I splitted the list of the objects from the original bucket into a txt files which contains 1million object attributes (name, size, modification date) from these files I have 1194 pieces. I was writing a script in PowerShell (6.2.0) which reads this file and starts the rclone copy processes in a separate thread. In a sequential mode I can reach 2 objects/second speed in a paralell mode i can reach max. 20-38 objects/seconds
S3compliantStorage1: Cloudian
S3compliantStorage2: Pure Storage FlashBlade
the machine has a separate 10Gbits connection to the storages and they have a 2 hop distance
I made some tests:
Count of object
Paralell sessions
Throttling (Gbit/s)
Speed (objects/sec)
300
90
1,074
28,5
10000
600
1,074
24
4000
600
5,369
33,29
4000
600
5,369
30,06
1000
600
0
23,51
1000
600
0
35,91
It seems to that the number of paralell sessions has no effect on the speed (object/second)
The speed calculation it's easy. This is the difference of the datum of starting create threads and the last finished thread and I'm dividing with the count of objects.
The interesting this it is if I try to download a big file to a RAMdisk into the Windows machine from the same source storage from a different bucket, than I can reach 1.2Gbits througput. If I try to upload this object from RAMdisk to the target object storage, I can reach 2.6Gbits. If I try to copy direct from S3compliantStorage1 to S3compliantStorage2 than I can reach only 300 Mbits speed.
Do you have any idea how can I speed up the copy procedure? I would like to reach min 300 objects/sec speed. The objects has 1-2KB size.
I guess you will see a speed improvement if using --no-traverse, especially as there are getting more and more folders and files in the target. More info here: https://rclone.org/docs/#no-traverse
You may get even better results with --no-check-dest, but it comes with a higher risk, so it is only to be used if significantly better than --no-traverse and the downsides are acceptable, so please test and read the documentation carefully here: https://rclone.org/docs/#no-check-dest
If the above isn't enough, then we can probably reduce the startup overhead by starting an rclone deamon with rclone rcd and then sending asynchronous copy commands using rclone rc sync/copy. I will defer the details until we know if it is needed.
Is the copy from S3compliantStorage1 to S3compliantStorage2 comparable, that is the same big file?
What are the object per second transfers speeds if you transfers a lot of small files in each of the three tests?
(Tip: You can use --ignore-times to make the commands repeatable)
I tried your suggestions but I not achieved increased speed.
"Is the copy from S3compliantStorage1 to S3compliantStorage2 comparable, that is the same big file?"
Yes, I downloaded the same file and tried to upload it to another storage.
I would like to make a little very controlled test which is possible to reproduce with additional debug logging, if needed.
First, I would like you to prepare a small test sandbox like this:
# Create an empty test folder with a copy of the big test file in S3compliantStorage1
rclone mkdir S3compliantStorage1:testbucket/testfolder1
rclone copyto S3compliantStorage1:yourBucket/yourFolder/yourBigFile S3compliantStorage1:testbucket/testfolder1/bigtestfile
# Create empty test folders on the RAM Disk (I call it R:) and S3compliantStorage2
rclone mkdir R:\testfolder1
rclone mkdir S3compliantStorage2:testbucket/testfolder1
Note: You may need to add --no-check-certificate to some of the commands.
Next, I would like to see the full output from executing these three commands (including the commands):
Because of deadline I had to see an another solution. I moved to a linux (installed a Linux subsystem on that widonws) and it can grants the expected speed 250-500 objects/second.
I think it can be some limitation of the number of paralell TCP connection on windows level.