Lots of Small Files

Eric_Anderson · November 4, 2016, 2:33pm

A Bit Of Background

I’ve been using Dropbox happily for several years but in the last yearish the performance has made is less and less usable. I am a developer and I store all my projects within my Dropbox. This means I have a LOT of small files that must be synced (git files, small source files, etc). Once Dropbox starts to exceed 300,000 files it basically will take so long to index on boot it is unusable. Also Dropbox doesn’t really provide a filtering mechanism so things like log files are continually uploaded and indexed throughout the day even though I don’t really care about them.

Until now my workaround has been to zip old projects so that I can reduce the file count but I’m finding that to be too much of a headache so am looking for a new tool. rclone looks great as I like the encryption and the ability to choose my cloud. I don’t really care about real-time syncing so loosing that isn’t a huge deal. I am good to have a background job that runs X times per day and syncs my local system to a cloud.

My Questions

Does rclone have reasonable performance with hundreds of thousands of small files? I don’t have a huge amount of actual storage (my current Dropbox is 32GB and that could probably be pruned some) but I will likely always have lots of small files.
Is there any specific cloud storage option that would perform best with lots of small files.
What rclone options would you recommend to ensure fast backup of lots of small files.

calisro · November 4, 2016, 6:03pm

rclone let’s you set the number of checkers and the number of transfer threads used to accommodate for lots of large files or lots of small files. For lots of small files, I think you’ll find it does reasonably well especially if you turn up the number of checkers.

Some things worth consdering:

Do you need encrypted volumes? Currently encrypted volumes (crypt) won’t store a md5sum on amazon drive resulting in rclone only able to compare based on size which will likely cause you problems. There are ways to work around this and the developer is considering other work arounds like calcuating the MD5 or SHA on the fly. If you will not be using crypt then this all won’t be too much of an issue.

Amazon, for example, doesn’t store modification times like we’d want. So a sync compare will only be by file size (see #1) or by checksum which I think you would need considering the types of files you are storing (xml for example).

Since you have a small amount of data even a checksum won’t take that long I wouldn’t think since ACD has that pre-calulated and all it needs to do is calculate the checksums on the local side. 32GB is pretty small.

I only have experience with Amazon so I will defer this. Im very happy with it though. I can saturate my 150 Megabit ISP with it.
You probably will want to run with --checksum --checkers X --transfers Y adjusting the X and Y per testing. I personally use the defaults but I have a mix of very large files and small files (160,000 total in 8TB). You’d probably do well by adding checkers so many threads can check small files quickly in attempts to use as much bandwidth as possible.

Eric_Anderson · November 4, 2016, 6:51pm

Thanks for the feedback.

I like the idea of encryption but I think for now I’ll leave it alone to avoid complication. I didn’t have it under Dropbox and in general nothing is very sensitive. I’m liking the idea of being able to tweak these parameters to meet my needs.

Once I get the full dataset uploaded to S3 I’ll post on the results on what works best for keeping the system in sync.

It will take a while to get the initial data up as I only have a 5Mbps upload connection and I don’t want to saturate it as I also need to work on this connection (so am currently limiting it to 100KBytes/sec). Once I go to bed I will restart it without the bandwidth restriction which should allow me to have around 600-700 KBytes/sec.

ccchan234 · May 29, 2018, 11:28pm

Dear all~

i rclone copying 2TB of data to g suit.
quite many are small files.
(files encrypted with cppcryptfs).

for large files transfer can be 24MB/s
= 200Mbit/s (although i am
on 500 Mbit/s lan)
but when it is small files,
it s just … bytes.

i am quite new to rclone.

could anyone kindly point to me
any change of parameters may help?

currently i just use

rclone copy google1 google2

thx

from above s post,
i am gonna read more
on checkers/transfers.

ncw · May 30, 2018, 4:19pm

Google drive seems to limit the number of files per second uploaded to about 2 per second in my experience.

You can try increasing --transfers which is what I’d normally suggest (for any other provider), but I don’t think it will make much difference with google.

It may be worth making your own credentials - they will be faster than rclone’s probably. You can then apply for more quota.

ramj · February 1, 2019, 2:59pm

Does rclone have a limit on the number of files ( in a folder) during the sync operation? Does the memory requirement increase with the number of files during sync operation or it is done in batches?

calisro · February 1, 2019, 3:03pm

I believe it is done in batches unless you use the --fast-list option which will then list directories more efficiently at the expense of more memory.

ramj · February 1, 2019, 3:08pm

So is it o.k. that there is no limit on the number of files for the default case . The memory is recycled across batches and batches are taken up in a serial manner ?

ramj · February 5, 2019, 7:02pm

What are the suggested options for good throughput in case we have a very large number of very small files. For example 10 Million files with average size of 20 KB. Any limit to the number of transfer threads? Does increasing the number of transfer threads increase the memory requirement linearly?

calisro · February 5, 2019, 7:11pm

Its a balance. You will only get 3 transfers per second so lots of little files will be slower no matter what you do. There isn’t much you can do there really. You can increase the checkers a little to make sure the checkers are running ahead of the transfers and keeping the transfers ‘busy’ with the 2-3 per second limit. Id personally experiment with the default of 4 transfers and increase checkers to 6-8. I’d also use the newest beta as it has some default limits in place to help with slowing things down a little to help with google rate limiting.

If you have lots of little files in each directory, then --fast-list will probably help at the expense of memory.

ramj · February 5, 2019, 7:20pm

If I abort Copy and restart, will it pick up from where it left? I would increase the number of checkers as you suggested.

calisro · February 5, 2019, 7:22pm

yes. It’ll just recompare source to destination and start fixing the sync again.

ramj · February 5, 2019, 7:24pm

So I should use sync instead of copy?

calisro · February 5, 2019, 7:24pm

I can’t tell you that. sync will make the source look like the destination (including deleting remote files that don’t exist in the source). Copy will only copy and replace files but not delete. That question depends on your use-case.

ramj · February 5, 2019, 7:28pm

Great. I will continue copy. This is a first upload to Google buckets. I was worried that Copy may not resume from where it left. Based on your response it seems that Copy will resume so I will continue with that. I will let you know how increasing the number of checkers helps in the overall throughput.

ramj · February 5, 2019, 7:36pm

As per the documentation (https://rclone.org/commands/rclone_copy/) the default checkers is 8

calisro · February 5, 2019, 7:41pm

I remembered it being checkers=4 and transfers=4. Defaults might have increased at some point or maybe I was just in error.

ramj · February 5, 2019, 7:50pm

Is there a way to check ( or part of verbose option as number of checker threads) during the copy operation? If NOT I will explicitly specify as you suggested .

calisro · February 5, 2019, 7:53pm

It’s the default so no need. Increasing them will only help keep the transfer queue full. You can increase them further but you may start hitting Google rate limiting which is counter productive.

ramj · February 5, 2019, 9:03pm

Thanks for the info. I have few parameters to play with.