Incremental backups and efficiency


#1

Hey there.

I’m trying to set up rclone as a simple backup mechanism. So far it’s working very well.

However, sometimes I’m surprised at how long rclone takes, even when there is nothing to do.

Example, I have a directory called src with about 13,000 files in it. The first rclone copy (to encrypted s3 storage) was of course very slow. As expected, subsequent backups were much faster (as very little has changed).

What confuses me is that subsequent runs of rclone copy take the same amount of time (about 10 minutes) regardless of what filters I give. For example:

rclone copy src Backup:src --max-age 10d

Should be nearly instant, as (at the moment) there are absolutely no files which have changed in the last 10 days. [For comparison’s sake, the equivalent find src -type f -mtime -10d takes less than a second to complete.]

Any ideas about how to make incremental backups more efficient?

Thanks!


#2

As you are running to s3, try the –fast-list flag if you have enough memory.

Also reading the modification time for each file on s3 requires another transaction, so either use --checksum or --size-only.

That should speed it up!


#3

Thanks for the ideas, @ncw

I tried --fast-list and indeed the memory use went up by about 10%. Unfortunately, the copy was actually slower. :open_mouth:

--checksum and --size-only each sped up the copy significantly—about 40% (--checksum was slightly better). Using them both (can I do that?) was spectacular, a 70% speedup!

But I remain confused. These checks are comparing files locally with those on the remote, right? In this (admittedly contrived) test case, there are absolutely no files qualifying to be transferred (--max-age 10d). The verbose output says:

Transferred:      0 Bytes (0 Bytes/s)
Errors:                 0
Checks:                 0
Transferred:            0

What exactly is it checking, and why?

Thanks.


#4

You win some and you lose some! Note that --fast-list will cost you less transactions at S3 so cost you less money.

I’d expect --checksum to be slightly slower than --size-only. Not sure what putting both the flags in does - it should probably return an error as it doesn’t make sense!

If it says Checks: 0 then it didn’t check any files! What is your command line?


#5

Yes, hence my confusion!

My command line is: rclone copy -v src Backup:src --max-age 10d
Backup is an encrypted s3.

For comparison, this command: rclone ls src --max-age 10d takes about 1 second to complete. (And outputs nothing, as expected.)

Thanks!


#6

Can you show me a full log made with -vv please?


#7

Post here? It’s over 26,000 lines. Maybe I can make a much smaller test case for you? Or just post one example of each kind of line in the output?

K.


#8

You can email it to me nick@craig-wood.com (put a link to the forum post in the email) if you want.

That would be perfect if you have time.


#9

Ok, I’ve emailed a small test case output.


#10

Thanks for the log.

I see what is happening…

It is the --max-age 10d which means that it is ignoring any files that are older than 10 days old. It appears from the log that all the files are older than 10 days so rclone is ignoring everything, hence the 0 checks.

Does that make sense?


#11

Yes, that makes sense. Indeed, the program is functioning as designed. My question is about efficiency.

Since there are 0 transfers and 0 checks, why does it take so long for this command to complete? It must be doing something it’s not reporting.


#12

It is listing all the files on the local and the remote. It is the remote listing which takes the time.

I expect you are going to ask me why the remote listing is necessary… Well that is to do with the way rclone works internally - it individually compares directories of files and it doesn’t know that there aren’t any files to compare until it has listed the directories and excluded the files.

You could do this instead… Using the latest beta

rclone lsf --max-age 10d src > files-to-copy
rclone copy --files-from files-to-copy src Backup:src

And that will probably be quicker in the case that you aren’t copying many files.


#13

Indeed, you anticipated my question. I think this is not what most users would expect. In the case of sync, perhaps a full listing of the remote is expected, because the remote needs to be adjusted to match the source exactly (e.g., perhaps deleting some files). But copy doesn’t seem like it should need to examine the remote beyond what the filters on the source dictate.

I would suggest some kind of documentation about this. Perhaps in the “Filtering” section.

Yes, that does what I want.

Thanks much.


#14

Yes that is exactly right. Unfortunately the filter --max-age doesn’t allow rclone to know that it doesn’t need to look in a given directory in advance. rclone traverses the source and destination file systems simultanously so it needs to know the filtering in advance in order to short circuit directories.

Rclone used to have a --no-traverse flag which is what you want here really. However I changed the syncing routines to make them use much less memory and I had to take that flag out.