FR: Versioned backend via suffix

jwink3101 · January 16, 2023, 6:22pm

The problem I am trying to solve is that not all backends support versioning and rclone is hard to use with immutable storage. This is backend agnostic, generally approachable, and doesn't ever delete or move files.

A "versioning" backend could look as follows:

Every file that gets transfered gets a suffix added like: <filename>.<timestamp> where <timestamp> is something like the epoch nanosecond time (maybe encoded for compactness?).
When a new version of a file is written, it just gets the new timestamp
When a file is deleted, it gets an empty file named <filename>.<timestamp>.d
- If the file is written again, the .d remains but a newer one is added.
When a file is moved, it gets a deleted (.d) and a new one is created. Server-side copy can still be done but server-side move must fall back as if the remote didn't support it.

With all of that, when files are listed, they are grouped by <filename> and the latest timestamp is returned. If the latest is .d, it is ignored in the listing.

In addition to versioning every file, this also means you can use immutable buckets without any loss of functionality. (and avoid things like Wasabi's 90 day policy).

I am not sure how hard this would be to implement. It doesn't seem too difficult and if I ever get a chance to get proficient enough at Go, I may give it a shot. What do others think of this?

Additional Note: You'd probably want a backend feature to purge where you specify a date and it deletes all versions older than that date as long as a newer one also exists. And/or maybe one to delete all but the last N versions.

I am well aware of the --suffix flag which moves the to-be-overwritten file to have the specified suffix. But what I am proposing goes way beyond that.

You'd still want to support --suffix and --backup-dir but you'd note in the docs that these are obviated by the underlying remote.

jwink3101 · January 16, 2023, 9:01pm

Just to add, you could even have a flag that shows as of some date. And sync either way against that

ncw · January 17, 2023, 11:04am

That is a nice idea.

Its pretty much the way S3/B2 versions work except the timestamp/ID is stored as metadata on the object.

I'd probably use the naming scheme used by s3/b2 for versions for compatibility.

This would need extending for deleted files. (b2/s3 call these delete markers).

Probably the only fly in the ointment is that opening a file will require a directory listing which is potentially slow.

jwink3101 · January 17, 2023, 7:57pm

I'd probably use the naming scheme used by s3/b2 for versions for compatibility.

Makes sense though the file names get really long. An 8 or 9 byte base32 or base64 encoded integer is much more compact at the cost of human readability.

Probably the only fly in the ointment is that opening a file will require a directory listing which is potentially slow.

Hmm. Interesting point but I have to wonder how often is a file read without already listing the directory? I guess some of this depends on where the abstraction inside of the rclone code happens, but the cases I can think of are:

copy(to), move(to) and sync (when just one file is listed).
Opening on a mount/serve

In regular sync, the listings happen. In serve, I thought listings were cached (but again, depends on where in the abstraction). I guess you'd have to document this and maybe also made things like --no-check-dest and --no-traverse do nothing.

Either way, it is more complicated than I originally figured. I don't imagine having time to learn enough to do it any time soon but if someone else is interested, I'd love to help and test. Otherwise, I'll put it on my todo list (after, you know, learning go. Alas, programming of nearly all sorts is just a hobby and not work so I am limited in time)

Ole · January 17, 2023, 8:44pm

I like the idea, but think everything making directory listings could become significantly slower, that is sync, copy, move, mount, serve, ...

Worst case example: A folder with 1000 files can typically be listed in 1 API call. If each file exists in 20 versions then it will require 20 API calls each retuning the typical upper limit of 1000 items - that is 20 times slower.

Best case example: A folder with only 50 files in 20 versions can still be listed in a single API call.

So it is important to have a good and well integrated purge algorithm, probably using a combination of days and number of versions like OneDrive, Google Drive etc.

ncw · January 18, 2023, 12:46pm

The answer is not very often. This is slow on a lot of backends, but it is used in places (eg when using a --files-from list) and when operating on a single file.

he directory listings will become longer by the number of backup files which may be very large so that could be a downside.

This would need a purge algorithm definitely - this could be implemented as a backend command with some parameters like - "keep the last 3 backups" and/or "delete any backups older than 1 month".

asdffdsa · January 18, 2023, 2:04pm

just as there is rclonebrowser, there is a need for rclonebackup.
not sure the need for go, as python would be fine and i could make some contributions.

restic?

jwink3101 · January 18, 2023, 2:26pm

I am no stranger to making Python tools that wrap rclone (lfsrclone , PyFiSync, syncrclone, rirb [reverse incremental rclone backup]), the latter of which is a backup tool.

The problem with this kind of backend in Python is that you can’t batch up transfers since the names will always change.

Every transfer has to be its own rclone copyto src:file.ext dst:file.ext.<date> which means that rclone has to do a lot of redundant work. Though you could speed it up with some flags since, by definition, the destination won’t yet exist

As for restic, that is certainly a valid option and has some real benefits over an rclone native versioning backend. But versions are more than just for backups. And whole files, rather than blocks with a database, have some real advantages (and disadvantages). Also, you can version any rclone remote as opposed to a local backup pushed.

Also, can restic now work on immutable storage backends? I seem to recall it being a work in progress but I may be incorrect

jwink3101 · January 18, 2023, 4:05pm

Obviously this is true but we should remember that it would still only keep changed versions. So if you have 100,000 files but only a few are modified, we’re not creating 100,000 files every time.

I think if this were to be implemented (and I recognize it’s a major “if”) the right approach is to make it clear in the docs that this comes with a performance hit related to the number of versions.

And of course, as we both mentioned, a prune command (where we need to be careful to not delete an older file if it is also the latest).

system · March 19, 2023, 4:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.