Rclone regex in filter causes spurious directory catch-all filter

What is the problem you are having with rclone?

Adding any filter that includes a regex is causing rclone to add a '+ ^.*$' directory filter to the top of its filters, meaning that lsf and similar report a bunch of directories I don't want to be looking at.

Run the command 'rclone version' and share the full output of the command.

mvernon@ms-be1071:~$ ./rclone version
rclone v1.59.0-beta.6081.bffe76dbf.mcv21_swift_bodge
- os/version: debian 11.3 (64 bit)
- os/kernel: 5.10.0-14-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.18.1
- go/linking: dynamic
- go/tags: none

[the "swift bodge" refers to a change discussed in a previous forum thread - Swift sync --checksum calls HEAD on every object so is very slow ]

Which cloud storage system are you using? (eg Google Drive)

Swift

The command you were trying to run (eg rclone copy /tmp remote:tmp)

./rclone --dump filters --filter-from rclone_filter_testing lsf --dirs-only eqiad: >testing_list_with_regex 2>/dev/null

The rclone config contents with secrets removed.

[eqiad]
type = swift
env_auth = false
user = REDACTED
key = SECRET
auth = http://ms-fe.svc.eqiad.wmnet/auth/v1.0

A log from the command with the -vv flag

Here's my filter file:

# Select the contents of a directory based on a regex
+ /wikipedia-commons-local-{{(public|deleted)\.[0-9a-z]{2}}}/**
# Just a regular glob
+ /global-*/**
# Bin everything else
- **

If I run my lsf, I get:

--- start filters ---
--- File filter rules ---
+ ^wikipedia-commons-local-((public|deleted)\.[0-9a-z]{2})/.*$
+ ^global-[^/]*/.*$
- (^|/).*$
--- Directory filter rules ---
+ ^.*$
+ ^wikipedia-commons-local-((public|deleted)\.[0-9a-z]{2})/.*$
+ ^global-[^/]*/.*/$
+ ^global-[^/]*/$
+ ^global-[^/]*/.*$
- (^|/).*$
--- end filters ---

...and the directory listing contains every directory in eqiad, because of the + ^.* filter.

If I comment out the second line (i.e. the one with a regex in), I get:

--- start filters ---
--- File filter rules ---
+ ^global-[^/]*/.*$
- (^|/).*$
--- Directory filter rules ---
+ ^global-[^/]*/.*/$
+ ^global-[^/]*/$
+ ^global-[^/]*/.*$
- (^|/).*$
--- end filters ---

and the output I want (i.e. just the directories starting global-).

As far as I can tell, adding any filter with any regex in causes rclone to add the + ^.* filter. Am I Doing It Wrong, or is this a bug?

It is because rclone doesn't analyse the regexp and doesn't know that /wikipedia-commons-local-{{(public|deleted)\.[0-9a-z]{2}}}/** can't match any directories anywhere, so it gives up and includes all directories just in case there are any files in them which match that regexp.

You could try rephrasing this in glob language so

/wikipedia-commons-local-{public,deleted}.[0-9a-z][0-9a-z]/**

Which I think will work.

1 Like

Ah, OK, yes, that makes sense. I'll have to see if all of my regex filters are amenable to re-writing as globs.
Would you take a doc update that notes this drawback of using regex for directory filtering?

A doc fix would be grand - thank you

Here's an attempt at a doc update - docs: note use of regexp filtering prevents directory optimisation by mcv21 · Pull Request #6221 · rclone/rclone · GitHub

[sorry for the lengthy round-trip time]

Looks very nice thank you - I've merged it now :slight_smile:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.