Question about rule order in --filter-from

Hello

I am horribly confused about the logic of rule-order in --filter-from files, despite having read the relevant rclone documentation. Perhaps someone could help me out.

The documentation says:

Each file is matched in order [what does that mean?], starting from the top [of the rules in the file?], against the rule in the list until it finds a match. The file is then included or excluded according to the rule type.

How does Rclone proceed as it moves from the filter file (that file being the file that contains the filter-from rules)? Does it start with a set of files to be included, where that set contains all files in the folder (together with its subfolder) at issue? If it does, then each rule in the filter file will expand or restrict that set. Except, hold on, it can't expand the set, because the set is already maximally large - unless, of course, one of the rules has performed a restriction . .

You see how confused I am. Add in the fact that rclone follows, but not exactly, rsync's own - seemingly underdocumented - syntax, and the whole thing becomes a horror.

Here's what I am trying to do. For some directory - call it dir - which is specified in the rclone command (as against in the filter file), I wish to:

(1) include all top-level items (files and directories) that start with . except (1a) some particular ones, let's say .bad1 and .`bad2' ;
(2) include, recursively, everything matched by rule 1;
(3) exclude all else.

Another way to think of it, might be 'top down' and it will stop once it hits a match.

So you want to start with the most specific thing you want to match first.

I think something like:

+ **/.bad/**
+ **/.bad2/**
- **

So in that example it would match directories with the name '.bad' or '.bad2' specifically named that.

If you can give a complete example, I'm sure we can figure out the syntax :slight_smile:

Thanks. (Note that in the example I gave I wish to exclude those two items with 'bad' in thei name. That's why I choose those names) However, I still do not understand. I think the trick to find a description of the mechanics that is all of pithy, clear and accurate (though it need not be comprehensive).

Here is the code and filter file I am working with. I am running the backup at the moment but I am unsure whether it works. (Unsure, because I have ended up with an item in the backup that should not be there; yet, another rclone job may have run - automatically - and put it there, even though my system is meant to be setup to run only one job at a time.)

Here is some bash that is part of a process of passing arguments to rclone.

sync_dots)
# Syncs, only, hidden (i.e. 'dot') files and directories
RL_ARG_PATH_SOURCE="$RL_PATH_JOB_SOURCE_DOTS"
RL_ARG_OPERATION='sync'
RL_ARG_OPT_VARS=(\
    	"--filter-from $RL_LSTS/dots_filter-from" \
    	'--max-depth=7' \
    	'--max-size 20M' \
)

Here is the filter file (simplied):

# EXCLUDE

- ./bash_history
- ./cups
- ./dbus
- ./gvfs/**
- ./keychain
- ./m2/repository
- ./Private
- ./recoll/**
- ./subversion
- ./Trash
- ./Trash

# INCLUDES (dot files and folders, all at the top level, and their contents)

+ /.*/**

# EXCLUDE ALL ELSE

- *

# EOF

Use the same filter with rclone ls that will tell you whether it is working or not without running a backup.

Comments on your filter from file

  • if you want to exclude a directory and its contents you need dir/**
  • if you want stuff to only happen at the root, then prefix with /
  • your rules starting with ./Trash will not be working - it will be looking for files called Trash in a directory called . (which it will never find), you either want Trash/** (to exclude a directory called Trash and its contents everywhere) or /Trash/** (to exclude a directory called Trash and its contents at the root only.
  • you don't appear to have included any files at the root as your includes don't include them

Thank you. However, I still lack an illuminating, pithy explanation of the matching logic (though what you wrote was some help).

Two specific matters remain unclear to me.

(1) I take it that directories are matched only if the relevant rule ends /** - yes?

(2) '[Y]ou don't appear to have included any files at the root as your includes don't include them.' Please state the correct rule to match everything within the backup's path that is such that its full path starts with a . . (For that is the rule I want and I tire of guessing, even if I can test with the ls command that you have.) Thank you.

It seems that few people understand rsync's logic or indeed rclone's. However, the following from man rsync seems a tolerably clear account.

FILTER RULES
       The  filter  rules allow for flexible selection of which files to transfer (include) and which
       files to skip (exclude).  The rules either directly specify include/exclude patterns  or  they
       specify a way to acquire more include/exclude patterns (e.g. to read them from a file).

       As  the  list  of  files/directories to transfer is built, rsync checks each name to be trans‐
       ferred against the list of include/exclude patterns in turn, and the first matching pattern is
       acted on:  if it is an exclude pattern, then that file is skipped; if it is an include pattern
       then that filename is not skipped; if no matching pattern is found, then the filename  is  not
       skipped.

       Rsync  builds  an ordered list of filter rules as specified on the command-line.  Filter rules
       have the following syntax:

              RULE [PATTERN_OR_FILENAME]
              RULE,MODIFIERS [PATTERN_OR_FILENAME]

       You have your choice of using either short or long RULE names, as described below.  If you use
       a  short-named  rule, the ’,’ separating the RULE from the MODIFIERS is optional.  The PATTERN
       or FILENAME that follows (when present) must come after either a single space or an underscore
       (_).  Here are the available rule prefixes:

              exclude, - specifies an exclude pattern.
              include, + specifies an include pattern.

Here is how I understand that. The default is to include all items (files and directories, the latter including their files, and recursively). The filter rules modify that set: exclude rules reduce it; include rules increase it (when it can be increased, i.e. is not at its maximally large stage). The rules operate upon the set in the order in which they are given. Imagine that the filter file: starts with an exclude rule, call that rule X; proceeds immediately to an include rule, call it Y; and next had, as its last rule, an exclude rule, call it Z. In that case, rsync (and rclone) will modify the initial, maximal set by applying X to it, then by applying Y to it, and then by applying Z to it.

As to the syntax - all the stuff with slashes and astericks - the rcync manual proceeds to describe that (though rclone's syntax is slightly different)?

Exclude rule.
Include rule.
Exclude rule.

It's pretty straight forward as the logic is just top down so if you match something too broad at the top, it won't continue down the rule set as it first match stops the chain.

The goal would be to match or not match depending on what you need to happen and go at that way.

Firewall rules are like this a lot:

https://www.memset.com/docs/server-security/firewalling/how-firewall-works-and-rule-ordering/

as an example.

The challenge is there is so much power, you have to figure out what your order of operations are and folks can help you make sure the logic is sound.

Thank you. However, you need to appreciate the following.

Rclone's documentation says, in effect: 'rclone works like rsync, only with these differences; or rather that's how include and exclude flags/files work; and here's how the filter rules work.

Yet, consider the following.

  • Few people understand rsync's logic and indeed it is not that easy to grasp that logic properly (for instance, I do not think the explanations given by people other than me in this thread are good).
  • How rclone modifies that logic (or at least syntax) is described tersely (e.g. 'Rclone always does a wildcard match so \ must always escape a \ ')
  • Rclone's two (or more?) stage explanation (that's includes-and-excludes, now here is filtering) adds a further level of complexity.

I do not mean to be abbraisive. Rather, I mean to convice that unless the documentation is improved then - at least for people who do not understand rsync well, already - the result is liable to be utter confusion and much wasted time.

EDIT: and I still don't understand whether every mention - in an rclone filter rule - of a directory (as against a file) needs some slashes and whatnot after it.

You can submit a pull request to help update the documentation if you suggestions or if you have some suggestions on how to make it better. type it out and I can do a pull request.

Thanks. I was considering doing that.

However, I have now come to think that actually I do not understand the syntax or even the logic. So (1) Is my account above - the one starting, 'Here is how I understand that' correct, or not? Also, and again: (2) how does the /** symbol work? Thanks.

EDIT: I see myself now that my account was incorrect, because, as you say: 'if you match something too broad at the top, it won't continue down the rule set as it first match stops the chain.' So, am I right in thinking the following? Everything starts off included, but then . . Actually, I am unsure about 'everything starts off included' and I am unsure about the 'then'. Please give a clear, fairly non-technical description of the algorithm. That's all I am asking (and I have have some background in computers and in logic so I don't think the problem is me being dim).

There is a nice example of the top down matching here:

https://rclone.org/filtering/#filter-from-read-filtering-patterns-from-a-file

The goal with any top down matching approach is that the rules will be evaluated starting from the top and processing down. If there is a match, it will not continue.

So the example on the page:

Working

# a sample filter rule file
- secret*.jpg
+ *.jpg
+ *.png
+ file2.avi
- /dir/Trash/**
+ /dir/**
# exclude everything else
- *

if you changed the order around, it has great impact:

Non Working

# exclude everything else
- *
# a sample filter rule file
- secret*.jpg
+ *.jpg
+ *.png
+ file2.avi
- /dir/Trash/**
+ /dir/**

That would exclude everything as that first match is too broad so any rule after that wouldn't matter.

I saw that documentation already. Together with what you have written just now, it may help me get my code to work. So, thank you. Yet, consider the following.

  1. I still do not understand how /** works.
  2. You have neither affirmed nor denied whether the set of files to be operated upon starts as all files in the destination path.
  3. Your text 'it will not continue' in your text, 'If there is a match, it will not continue' is opaque. What does it mean?

Use ** to match anything, including slashes ( / ).

Regardless of a filter, it operates on the files in the destination path that you provide. Filtering provides a way to limit the actions in the destination path.

In my non working example, the first rule listed is the:

- *

So if that rule matches, it does not continue / it stops / it does not go any further as it found a match so any rule after that does not matter since it already matched.

So an example:

[felix@gemini ~]$ rclone ls /home/felix --filter-from filtertest --dump filters
--- start filters ---
--- File filter rules ---
- (^|/)[^/]*$
- (^|/)secret[^/]*\.jpg$
+ (^|/)[^/]*\.jpg$
+ (^|/)[^/]*\.png$
+ (^|/)file2\.avi$
- ^dir/Trash/.*$
+ ^dir/.*$
--- Directory filter rules ---
- ^.*$
+ ^.*$
- ^dir/Trash/.*$
+ ^dir/.*/$
+ ^dir/$
+ ^dir/.*$
--- end filters ---
[felix@gemini ~]$ rclone ls /home/felix --filter-from filtertest

and if I fix the filters and say only give me my beta directory:

[felix@gemini ~]$ rclone ls /home/felix --filter-from filtertest --dump filters
--- start filters ---
--- File filter rules ---
+ (^|/)beta/.*$
- (^|/)[^/]*$
--- Directory filter rules ---
+ (^|/)beta/.*/$
+ (^|/)beta/$
+ (^|/)beta/.*$
- ^.*$
--- end filters ---
 33775680 beta/rclone

On /**: yes, I had seen that documentation. Does every line ending with a folder need that suffix? More precisely: if I wish to include some folder (be it a specific folder or a 'wildcarded' range of folders), or if I wish to exclude some folder (with the same qualifications), do I need to append /*** to it?

'You have neither affirmed nor denied whether the set of files to be operated upon starts as all files in the destination path.' Sorry, but you still have not done that, so far as I can tell. You write: 'Regardless of a filter, it operates on the files in the destination path that you provide.' The 'it' here is the filter, right? But does that answer my question? I don't see how. Here is why. Suppose I supply an empty filter-from file. If passed such a file, and a destination path. will rclone include everything in the destination path, or exclude it? I take it - from the rsync manual - that the answer is: include it.

Perhaps the fundamental problem in all this is as follows. The rsync manual describes rsync's logic in terms of expanding and constricting an initially maximal set of files. That idea is fairly clear. However, rclone's documentation - and what you've written - presents filtering in a different way, namely, in terms of the stopping after matches. I find that latter idea less clear than the idea in rsync's manual. Also, it is unhelpful to describe rclone's logic in a different manner to the way rsync's manual describes rsync's logic; for, rclone is (in a certain way) based on rsync.

Finally: I've little idea what (^|/)[^/]*\.jpg$ means, though I suppose that rclone's documentation will tell me.

Yes initially all files are passed to the filter.

The way it works is this

  • for every file under consideration
    • for each rule in order starting from the top
      • match the full path of the file against the rule
      • if it matches and it was an include rule, include it and finish
      • if it matches and it was an exclude rule, exclude it and finish
    • if you get to the end, include the file

Note that this says nothing about directories. Rclone does not filter on directories only on file paths. Hence to include a directory you need /path/to/dir/** which matches all files under /path/to/dir

There is an issue to make /path/to/dir/ the same as /path/to/dir/** which we'll do eventually but this is merely a shorthand syntax. If you write /path/to/dir rclone will think you are talking about a file.

I wrote the filter documentation, but I've had lots of contributions to make it clearer and I'd welcome yours too :smiley:

That is a regular expression - the one that is used for the match. The file globs are converted into regular expressions which if you know them will tell you exactly what matches and if you don't they look like someone bashed the keyboard with their head :wink:

Thank you. So, the logic/algorithm is as follows?

Let filter-file be the filter file (the name of which is passed to rclone).
Let rules be the set of rules in filter-file.
Let rule be a (any) single rule within rules.
Let destination be the destination path passed to rclone.
Let item be some (any) file or directory. EDIT: and if the item is a directory then it should be expressed in any rule as item/**.
Let backup be the set of files to be backed-up.

Set backup to contain each item in destination.
For each item within destination
(
    For each rule, moving serially through rules
    (
        If rule matches the full path of item then
        (
            if rule is an include rule then add item to backup (unless it is in there already).
            if rule is an exclude rule then remove item from backup (if it was in there).
            Proceed to the next item.
        )
)

I find it easier to use a real example rather than speaking in generalities.

It doesn't matter if the rule is include or exclude. In your statement, if a rule matches, it stops processing for that file(you are using the word item).

Each file is evaluated based on the rules/filters you have defined. The first match stops the processing.

@Animosity022: OK, but I'd like @ncw to confirm (or deny) that I've got the algorithm right. If he can confirm it, then we're done and I'll leave you both alone. :slight_smile:

I don't think your explanation is quite right though as there isn't a concept of a backup set.

You have a file.

You apply the set of rules to the file.

If you want to include things into a backup, you'd create a set of includes and exclude everything else:

+ **/include.txt
+ **/keep.txt
- **

So each file is evaluated against that ruleset. It goes top down until it finds a match and stops.

I believe you are wrong. Here's why.

  1. My explanation did contain a concept of a backup set.

  2. @ncw did respond affirmatively to the following sentence of mine. 'You have neither affirmed nor denied whether the set of files to be operated upon starts as all files in the destination path' in the affirmative. For, he said (with admittedly some unclarity): 'Yes initially all files are passed to the filter'.

I am sorry to be increasingly direct but I feel entitled to decent answers to my reasonable questions. Also, and frankly: I did not in fact in this instance ask you.