Optimizing copying a large number of small files to Swift


#1

Hello,

I have a use case where we need to upload a large number of small files from a local file system to a Swift remote, and I am trying to optimize the file transfer. These files are new and not going to be in the destination, and I would like to disable as much checking of the remote as possible. Is there any way I can disable listing, modtime and checking checking and just do a dumb copy? With the number of requests needed for just the files, I want to minimize the other requests to the object store.


#3

Try this beta with the --no-traverse flag

https://beta.rclone.org/branch/v1.45-003-g872b5e7f-no-traverse-beta/

That will just do a dumb copy as you put it!


#4

Thanks Nick! I’ll try that out.


#5

@ncw I’ve been testing with the beta version, and while I can see the GETs on subdirectories in containers stops with --no-traverse, I’m still seeing a HEAD after each PUT. Is there any way to stop it from doing that? I know you want to verify the size, mtime, and checksum is correct, but on a PUT in swift the md5 is returned in the 2xx response. Is there any way to use that instead?


#6

That is a good thought… I’ve adjusted the code to do that for single part uploads so it will use the hash in the response instead of doing another HEAD request.

Have a go with this and tell me what you think!

https://beta.rclone.org/branch/v1.45-012-g6e000d26-no-traverse-beta/ (uploaded in 15-30 mins)


#7

Thanks @ncw I’ll check this out as well.

One other question along these lines: what exactly is going on with the “checkers” after a sync? Is it going back to the destination again to check and make sure it matches the source? Or is it just checking locally to see if anything changed while it was syncing?


#8

@ncw - the new code confirmed only does one PUT per file and doesn’t do a subsequent HEAD. This is a good improvement for dealing with many many small files.

I’m still not sure about what the checkers do. Is there a description of the overall algorithm rclone uses during sync? Generalized for any backend?


#9

great :slight_smile: The code is now in the latest beta and will be released in v1.46

No there isn’t an overall description of how the sync works… There are descriptions of how the matching works. I should really write one!

What --checkers does is control how many parallel directory listings are running. It is also used as a general measure of concurrency.


#10

Thanks @ncw! Any idea when 1.46 will be available?

Also can you point me at the documentation describing the matching? I’m trying to understand what all is checked and what might trigger extra unnecessary steps when trying to upload millions of files from local file to Swift.


#11

Start of Feb is the plan.

If you check out the docs for –size-only and –checksum you will find some info about how each file is matched.

You want to use one of those to avoid modification time reads when doing sync or copy without --no-traverse.