Feature Idea: Chunk-Based Dedupe backend

jwink3101 · October 28, 2019, 4:46pm

Rclone Propsal

Preface

Let me preface with saying that I think the following is a really cool idea. Unfortunely, I do not know Go (yet...) so I can't work on it. I am interested in learning if I ever get the time so maybe I could in the future...

Idea

The recently release chunking backend got me thinking about another related backend I think would be great: deduplicating backend

This is heavily inspired by restic and thier content-based chunking. (blog post about it and BSD licensed vcode).

The idea would be basically as follows. At the top level, there will be two directories with the following structure:

blobs/
    00
    01
    ...
    FF
files/
    file1
    dir/1
        file2
    ...
    ...
    fileXYZ

Each fileXYZ will contain a list of blobs, in order, that are concatenated into the file.

When a file is uploaded, it will be chunked (see below on details/issues) and then the new blobs are uploaded and a fileXYZ file with the details is then created.

Additional Details and Issues

I am starting to think this through. There are still open questions but here is my initial take

Opperations

Write is chunk > upload blobs > upload file text reference

Moves, copies, etc are the equivalent on the text reference

deletes mean you delete the reference but leave the blobs (for now)

read: download text reference, download blobs, assemble file

Metadata

This could be a backend that doesn't support mtime and size and the like or it can encode metadata. The first idea would be to include it in the fileXYZ text but then it has to be read to be parsed. This isn't too bad since they are small text files but it may be a lot of transactions. Alternatively, the actual filenames could be encoded (json?) with that metadata. This may run into file name length limits.

Cleanup

Purging involves downloading all files/ (small text files so not too bad) and a list of all blobs. Then delete any unreferenced blobs. The downside of that is you really need to do it at the top level only since you need to know all referenced blobs.

Alternatively a new command could be introduced to do this.

An open issue is simultaneous access. Not sure if we need to address this anyway

Alternative file store

An alternative file store is to just have a single file (json, sqlite, another key-value store?) be all of the files but then you need to have an exclusive session. I am putting this out there but I do not think it is the right answer

Cost of Chunking

I think rclone should follow its current process as to whether or not to compute the hash of a local file. It really depends

Chunk Size

I think there should be an (advanced?) option to set minimum and maximum chunk sizes. And the average size should be settable (though I think it can only take certain base2 numbers). If the chunk size changes, it'll mean the deduplication will likely not work across the previously-sized chunks but there would be no risk of data loss

Thoughts?

I think I covered most issues. Like I said in the preface, I cannot (at the moment) write this. But if/when I finally learn Go, if nobody does it first, I would very much like to see this backend so I will try to do it myself.

ncw · October 28, 2019, 9:55pm

Interesting idea... Sort of like restic with its content chunking.

It sounds like a lot of work to implement.

I note that if you used chunker + mailru then you get this for free as mailru does content based deduplication, so you offer up a hash and mailru says, "OK I got that file already, don't bother uploading".

Currently people use --backup-dir with rclone sync to do a poor man's version of this.

system · January 26, 2020, 9:56pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.