Incremental backups and efficiency (continued)

Continuing the discussion from Incremental backups and efficiency:

Summary of that previous thread: Using a filter in order to achieve an incremental backup (e.g., rclone copy src Backup:src --max-age 10d) was unexpectedly slow, even when the filter ended up matching no files at all.

Various options were suggested, with various results. The conclusion was to run a first step of identifying the files locally and then feed that list into the copy, like so:

rclone lsf --max-age 10d src > files-to-copy
rclone copy --files-from files-to-copy src Backup:src

Then yesterday this comment came in from @Ciantic

I think the --no-traverse was put back, since it's in the docs. So I think the command is just copy with no traverse flag.

So...my reply (which I hope @Ciantic sees) :

Thanks much for the idea!

Yes it looks like that flag has returned as of v1.46 and it is promising. It was restored for Using --files-from with Drive hammers the API

There's a lot in that thread so I'm not sure what applies and what doesn't. (It looks like the problem there was that the --files-from flag was unexpectedly slow, where in my case it was the solution to the slowdown!)

So...

  • Can anyone say if copy --no-traverse is appropriate for my situation?
  • Are there any minuses to using this flag?
  • [Bonus question] Is this the default behavior for copy? If not, why not?

Thanks.

There is a problem with this strategy: if you delete a folder or file from the source, which is older than 10 days, this will not be reflected in the destination.

And there is also the problem of folders or files that were just moved (which does not change the file timestamp)

This, combined with the use of "copying" (rather than "syncing"), can cause the backup to become "bloated" over time with multiple old files.

I've been through this in the past:

The solution is (if you want to use this strategy), from time to time, perform a "full sync"

Yes, I concur.

To write a proper incremental backup, the best choice would be to utilize some kind of filesystem log which would explicitly list all the changes since a given time. But for a quick-and-dirty backup, a frequent copy combined with a periodic sync is sufficient.

Thanks.

The goal would be to use the right tool for the job. It seems you are trying to use rclone like a hammer when you really want a screwdriver.

If you want to backup incremental, there are other options which work much better.

https://www.duplicati.com/ is free and works with GD out of the box and does encryption and is a backup tool.

https://restic.net/ is another option but I find that less easy to use out of the box as it's nice, but a little complex at first.

Completely agree.

I'm a very satisfied user of duplicacy.com

Yeah, you have a good point. I use rclone because I like it and it was really easy to set up and maintain.

I'll look into these alternatives and give them a spin.

Thanks.

Since nobody has yet answered these questions I decided to do a test myself locally. My tests showed that the --no-traverse flag did what I wanted (and was not the default behavior), and the performance was excellent. So thanks again, @Ciantic, for bringing this to my attention.

As to the remaining questions, well I lack the expertise to address them. I cannot think of any minus to using this flag in the case of copy, but I guess time will tell! Given that it's not the default, I am guessing there is some subtle drawback. I did find a comment from 2016 https://forum.rclone.org/t/what-are-disadvantages-of-using-no-traverse-with-large-nr-of-files/562:

But deciding "more or less identical" is kind of hard to automate.

Thanks,

K.

--no-traverse is a pro/con sort of setting.

In some cases it will lead to slower checking of files, while in other it will be much faster.
I could give an explanation here based on my (probably flawed) understanding but I know NCW has directly answered this question (and probably more than once). I would suggest you search for his answer(s) to understand how it really works rather than take the info second-hand.

I think the TLDR of it is that --no-traverse is faster when you are uploading jut a few files to a large folder, while it is potentially detrimental to performance when you have to do work on many/most of the files inside a folder. --no-traverse basically skips listing everything in a folder and only checks the one thing it needs to do right now. This is faster for that one operation, but if you end up checking a lot of individual files like this inside the same folder it would have been more efficient for rclone to just list them at the start.