Efficiently list a few directories among many with rclone filtering

davidvandebunte · March 14, 2023, 10:06pm

Thanks so much for this awesome tool. It made my life easier once again today!

I apologize in advance for posting to SO and only linking from here, but I'm used to the richer Markdown syntax supported there. Is there any policy against posting to SO rather than here? My question is:

asdffdsa · March 14, 2023, 10:15pm

hello and welcome to the forum,

fwiw, rclone supports markdown so i did a simple copy/paste for you. looks ok to me....

Running the following produces a quick listing of a directory with e.g. 10 objects:

$ rclone ls s3:bucket/dir1/
...

In an attempt to perform some other operation (besides listing) on the objects in this and one or two other directories, I'm using a filter-file.txt (for documentation, see Rclone Filtering):

+ /dir1/**
+ /dir2/**
- **

And then running:

rclone --dump filters -vv ls s3:bucket --filter-from filter-file.txt
...

Because the /bucket/ has many other directories in it, however, this takes much longer.

It seems like the directory filter section would have some answers, but my assumption at the moment is it can't help this situation. My assumption: because the /bucket/ directory is a subdirectory of dir1, there will always be a directory filter to include /bucket/ and everything in it will have to be matched against the first two + filters. Is that correct? Is there a workaround anyone is aware of to get faster behavior like the original ls above (besides calling rclone two times)?

asdffdsa · March 14, 2023, 10:16pm

i could be confused, but dir1 is a subdir of /bucket/

--filter-from combines includes and excludes.
whereas --include-from is just includes.

davidvandebunte · March 15, 2023, 12:16am

Ah great that Markdown is supported here, I just assumed based on a few posts it wasn't. Yes, the quoted line should have "subdirectory" changed to "parent directory" or similar.

If I use --include-from with the following include-filter.txt example:

/dir1/**
/dir2/**

Isn't that the same as the original example but with --filter-from? From the docs:

If there is an --include or --include-from flag specified, rclone implies a - ** rule which it adds to the bottom of the internal rule list. Specifying a + rule with a --filter... flag does not imply that rule.

I just tried it and it's still "slow" in the sense of needing to go through all the other directories.

I'd also tried to use the following filter-only-include.txt as an argument to --filter-from, though I'm not sure offhand why it's so slow:

+ /767cfdd8-da56-4cf1-8f34-780935f833e8/**
+ /58cd5fdd-48c1-4e90-a74a-41865cb8892d/**

asdffdsa · March 15, 2023, 12:21am

ok.

well, i am not an expert with filters but now we have your original post and some more details.
hopefully another forum member will stop by soon to comment......

ncw · March 15, 2023, 10:21am

Rclone should only list the top level of /bucket/ then recursively list /bucket/dir1/ and /bucket/dir2.

How many entries does /bucket/ have? For it to be taking significantly longer I'm guessing millions?

The rclone filters don't have the level of optimisation so that it can avoid listing /bucket/ though it is completely obvious to you and me that it doesn't need to.

I think you will have to run two rclone ls commands one for each directory if you wish to avoid listing the root.

davidvandebunte · March 15, 2023, 2:54pm

OK thank you! Yes, the /bucket/ has millions of directories, each with perhaps 10-15 files.

I'd run into a similar situation with different data perhaps two months ago so I asked about it this time. It's really not the end of the world to call rclone a few times or even many times in my current situation (calls to rclone backed to deglacierize). Two months ago I was trying to rclone copy TBs of data and so I wanted to get down to one command to make it happen in 2-3 days rather than several times longer.

I wonder if in the earlier situation it would have been better to call rclone ls many times first, to build a list for --files-from:

If the --no-traverse and --files-from flags are used together an rclone command does not traverse the remote. Instead it addresses each path/file named in the file individually. For each path/file name, that requires typically 1 API call. This can be efficient for a short --files-from list and a remote containing many files.

I'll have to think about it more if I come back to the situation. Anyways, thanks again!

ncw · March 15, 2023, 3:08pm

Yes you want to avoid listing that at all costs!

Yes, --no-traverse and --files-from are very efficient on s3. They work less well on Google drive though, so it does vary.

system · April 14, 2023, 3:08pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.