Categorization of --backup-dir backups as "Reverse Incremental"

jwink3101 · December 5, 2022, 7:20am

Many people, myself included, have referred to the following style backups as "forever forward incremental"

rclone sync source: dest:current --backup-dir dest:backups/<date>

I've been thinking about this* for a while and I think it is incorrect classification to call it "forward incremental". I believe this should be classified as "reverse incremental".

Let me explain.

Let's ignore the "forever" part. It just muddies the waters.

As I understand it, incremental backup looks like:

Run 0:  Full-Backup0
Run 1:  Full-Backup0 + diffs1
Run 2:  Full-Backup0 + diffs1 + diffs2
...
Run N:  Full-Backup0 + diffs1 + diffs2 + ... + diffsN

You need to take the initial Full-backup0 and play forward the chain of diffs to get to any arbitrary state

Reverse incremental, as I understand it is as follow:

Run 0: Full-Backup0
Run 1:              Full-Backup1
                    Mods0
Run 2:                            Full-Backup2
                    Mods0         Mods1
...
Run N:                            Full-Backup2  ... Full-BackupN
                    Mods0         Mods1         ... ModsN-1

This is also what the aforementioned rclone command returns. At any given point in time, the full backups (dest:current) is the most up to date with the mods to get there (dest:backups/<date>) being ModsN-1. To get to an arbitrary state, you start at Full-BackupN and replay in reverse until you get to the desired state.

What do you think? Am I totally missing something? Do I have it all wrong?

It is of near-zero consequence, but I propose we refer to this style as reverse-incremental (in so much as we have collective action).

*: I am working on a tool that mimics this, wrapping rclone, but saves the listing of the dest:curr directory to speed it up. Stay tuned.

Ole · December 5, 2022, 8:28am

Interesting discussion.

I think it is a bit misleading to say that --backup-dir implements an incremental backup, because it may be impossible to establish a truly historic snapshot.

This comes from the lack of information on the date a new file is added to the backup (unless also saving INFO level logs in the backup location).

As an example, let’s say you have the following backup:

current/file1
current/file3
backups/2022-12-01/file1
backups/2022-12-01/file2

Now try to establish the historic state before the backup on 2022-12-01. Does it consist of these files

backups/2022-12-01/file1
backups/2022-12-01/file2

or these files

backups/2022-12-01/file1
backups/2022-12-01/file2
current/file3

jwink3101 · December 5, 2022, 2:10pm

Again, to be pedantic, since this is the intent of the post, I think it’s reverse incremental.

But your point stands.

The approach lets you recover files easily from the past and the latest full. Building up a full snapshot is harder if it’s even possible.

With that said, the tool I developed (the “*” in my post) does three things (1) is saves the full file listing each time including hashes if desired. (2) it stores the file info of the modified/deleted along with mod or deleted, and (3) always uploads the log.

So to build up a snapshot, you theoretically know the files and know how to find them. But this is theory. The intent is really to be just like Rclone’s results and it offers no tool to do it (I may make a demo just for fun though).

With all of that said, even if not a “true reverse incremental”, I still believe that it’s closer to reverse than forward. I do, however, welcome being convinced otherwise.

Ole · December 5, 2022, 3:10pm

Sounds right and fully agree, with these additions you seem to have full reverse increments of the history.

I think it may be a good (and fun) idea to build a historic recovery and verification tool as part of you initial development.

That is really the only way to test/prove that your have correctly saved all the bits and pieces needed to do a historic recovery.

jwink3101 · December 5, 2022, 3:16pm

I was just thinking that. What can go wrong is when a backup fails.

The tool I am building speeds it up by saving the file list from the last run but even there, it needs ro handle restarting a failed run leaving the state in disarray (though theoretically without information lost). The approach I take is that if a run fails, it does list the destination again (a la regular rclone sync) to ensure the proper state.

That could break the ability to fully recover.

However, if a full snapshot recovery is of interest, this is not the tool. Restic, kopia, etc are the right approaches.

jwink3101 · December 21, 2022, 12:35am

Just FYI, I released the tool (forum post, github) and took your suggestion. I created a proof-of-concept for two different approaches (easy: using known hashes, hard: tracking diffs).

IT WORKS... mostly. I did end up adding some additional artifacts to the backup (saving the computed diffs) which made it easier. The problem is, it's fragile. If a run is interrupted anywhere between the desired point and the end, then the information/artifacts are not representing truth. To address this, I (a) wrote them first so they will be there and (b) added prefix that is later removed so it is clear.

The end result is that it is doable in ideal conditions and, with some work, can be backed-out in less-than-ideal ones.

However, my conclusion was that, while doable, if point-in-time restore is your primary use case, this is not the right tool! (use restic or Kopia or the like).

It did also confirm that this can be thought of as "reverse incremental". You do not need anything prior to the desired restore point!

Thanks for the feedback!

Ole · December 21, 2022, 8:16am

Looks good and really like your proof-of-concepts in the well explained Jupyter notebooks

I fully agree with your thoughts on robustness and always picking the right tool for the use case. Things can get really ugly after a number of failed syncs to a remote with non-atomic file updates (e.g. SFTP).

Happy Holidays!