How exactly --files-from provides performance improvements?

I am using rclone v1.48.

Currently, I have a separate utility which gives a list of files to be transferred to Remote Server. And I am passing this FFL (Flat File List) to Rclone during copy operation with --files-from flag.

From the Rclone documentation, Rclone creates a Filter object from the files passed using --files-from flag. And When using the copy command, Rclone uses this Filter object to filter files during source walk.

My first question is, If I am passing a list of Files to Rclone, why does Rclone performs the OS walk? Shouldn't Rclone use this FFL to directly locate the files on source and start the transfer? Also is it possible to skip the OS walk if an FFL is present? (Format of FFL can be changed according to the requirement)

From the above changelog from v1.45, it says that Rclone does not scan directories if --files-from flag is used, which is not the case as a walk is performed, and the objects are loaded into the Source DirTree after filtering.

======

Rclone creates a DirTree for both Source and Destination. Let's say I have a source which contains 5 Million objects (Both files and folders). All these objects are added to DirTree and loaded into memory. I copied all these objects to the Destination using the copy command. After a week 100K objects changed. I ran sync command and Rclone created a Source DirTree of 5 Million Objects and a Destination DirTree of 5 Million objects. And if --files-from flag is used, the files listed in FFL will also be loaded into memory as Filters and this is another overhead.
This process overall results in massive memory usage during copy and sync commands if source size is huge or for a large number of objects.

My second question is, Is there any way we can reduce the memory consumption by making use of Files instead of loading everything into memory? Or if there is any other optimal way to reduce memory consumption.

Even though you know which files to transfers (the filtering part is done) rclone still needs to check the status of the files to know how they compare against files on the remote - like comparing size and mod-time to determine if it should update or not.

Yes, it won't scan the whole directory - but it still needs to scan the spesific files it will be working on because it needs those attributes to make choices, and those attributes could be in flux right up until the moment they are being accessed.

I know there exists optional flags to
--ignore-size
--ignore-times
--ignore-checksum
-I
(capital I as in Ice cream , I think this means ignore all attributes and just copy blind, so this is probably the best to use for you)
https://rclone.org/docs/#i-ignore-times

If all these are activated then rclone should have no reason to query any files - so hopefully if the code is smart enough here it should just skip the entire process and start to copy immediately.

I'm not an expert when it comes to mega-scale filtering operations. Memory is usually a non-issue for me when filtering, so there might be better ways to do this - but this is the thing that comes to mind to be first. I hope I am not mis-teaching you - but you can at least consider this ideas worth trying and seeing how the memory usage looks.

I suspect there are options not even related to filtering or listing where we can save a lot.

There are many ways to tweak memory usage in general. I don't actually have any experience of how much memory is needed to filter THAT many files - but if you can share your remote configuration (please redact all sensitive info) and the sync-command you use - then I will try to list all memory-consumption related options I think would be relevant (that I know).

Also - it helps if you can tell me what your memory constraints are, so I have some rough idea of what we are trying to work within.

If you want rclone to omit the scan then you use --no-traverse. That was from v1.47.0

Make --files-from traverse the destination unless --no-traverse is set (Nick Craig-Wood)
this fixes --files-from with Google drive and excessive API use in general.

If you want to use less memory then don't use --fast-list.

Using --files-from and --no-traverse should be effective too.

1 Like

@ncw

If you want rclone to omit the scan then you use --no-traverse . That was from v1.47.0

The --no-traverse flag only omits the Destination scan and instead checks for all the files listed in FFL one by one with the Destination. What I meant was to omit the Source scan if --files-from flag is set.

Assuming the Destination is empty and I have to transfer a Source containing > 1Million Files. Can Rclone copy all these files to Destination without performing a scan on Source? Since the Destination is empty so even if a Destination scan is performed, it won't take any time. During copy and sync command and without setting the --no-traverse flag.

@thestigma

Using -I flag will only skip the filters but the Source scan will still be performed. My objective is to eliminate the Source scan when using an FFL to pass a list of files to be transferred.

I don't think you will skip the filters, but you will skip the comparisons. Maybe that is what you meant. Maybe it will still technically scan, but just not check anything from metadata. That I can't really say.
I don't rightly know if it is possible to eliminate the source scan entirely. NCW or someone else with more knowledge on that will have to chime in. It seems like it should be possible, but I don't know if there exists a flag that specifically does this.

But if the main problem here is a massive memory-load perhaps it could be a workaround to break up the sync into multiple filtered segments. Like if you do a sync that handles files starting with A-H and let that finish before tackling the next section, that should make the transfer-list in memory for rclone much smaller, and you could break that up into as much as 32 segments (base32) and use a for-loop to iterate through it to keep script manageable. How much this would benefit memory consumption during the scan depends on a lot of spesific memory-handling details inside rclone so I can't say for sure if this would be a good solution - but it's something I would consider doing a quick test for.

There may also be many other ways to save a lot of memory aside from what is directly related to filtering, but I would need to see the config and the command flags you use to comment on this.

The way rclone currently works is that it needs the info from the source object before it can transfer it. That info can come from a listing or from a HEAD request. So currently you can do either a list bucket (without --no-traverse) or a HEAD request on each file (with --no-traverse).

Rclone needs to read something about the source objects so it will either have to list the bucket or HEAD the objects...

However rclone will start transferring things immediately with sync so you don't have to wait for the source scan to complete.

It would be helpful if you could what your source and destination backends are and how the 10M files are distributed (eg all in 1 directory or like a normal file system).

It might be that --files-from could be optimized better for your use case.

This is a bug which is now fixed in the latest beta - apologies for taking such a long time to track this down!

--files-from was scanning the entire directory structure if --no-traverse wasn't used.

So I think this should work a lot better now.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.