I’m trying to set up rclone as a simple backup mechanism. So far it’s working very well.
However, sometimes I’m surprised at how long rclone takes, even when there is nothing to do.
Example, I have a directory called src with about 13,000 files in it. The first rclone copy (to encrypted s3 storage) was of course very slow. As expected, subsequent backups were much faster (as very little has changed).
What confuses me is that subsequent runs of rclone copy take the same amount of time (about 10 minutes) regardless of what filters I give. For example:
rclone copy src Backup:src --max-age 10d
Should be nearly instant, as (at the moment) there are absolutely no files which have changed in the last 10 days. [For comparison’s sake, the equivalent find src -type f -mtime -10d takes less than a second to complete.]
Any ideas about how to make incremental backups more efficient?
I tried --fast-list and indeed the memory use went up by about 10%. Unfortunately, the copy was actually slower.
--checksum and --size-only each sped up the copy significantly—about 40% (--checksum was slightly better). Using them both (can I do that?) was spectacular, a 70% speedup!
But I remain confused. These checks are comparing files locally with those on the remote, right? In this (admittedly contrived) test case, there are absolutely no files qualifying to be transferred (--max-age 10d). The verbose output says:
You win some and you lose some! Note that --fast-list will cost you less transactions at S3 so cost you less money.
I'd expect --checksum to be slightly slower than --size-only. Not sure what putting both the flags in does - it should probably return an error as it doesn't make sense!
If it says Checks: 0 then it didn't check any files! What is your command line?
It is the --max-age 10d which means that it is ignoring any files that are older than 10 days old. It appears from the log that all the files are older than 10 days so rclone is ignoring everything, hence the 0 checks.
It is listing all the files on the local and the remote. It is the remote listing which takes the time.
I expect you are going to ask me why the remote listing is necessary... Well that is to do with the way rclone works internally - it individually compares directories of files and it doesn't know that there aren't any files to compare until it has listed the directories and excluded the files.
Indeed, you anticipated my question. I think this is not what most users would expect. In the case of sync, perhaps a full listing of the remote is expected, because the remote needs to be adjusted to match the source exactly (e.g., perhaps deleting some files). But copy doesn't seem like it should need to examine the remote beyond what the filters on the source dictate.
I would suggest some kind of documentation about this. Perhaps in the "Filtering" section.
Yes that is exactly right. Unfortunately the filter --max-age doesn't allow rclone to know that it doesn't need to look in a given directory in advance. rclone traverses the source and destination file systems simultanously so it needs to know the filtering in advance in order to short circuit directories.
Rclone used to have a --no-traverse flag which is what you want here really. However I changed the syncing routines to make them use much less memory and I had to take that flag out.