Let me preface with saying that I think the following is a really cool idea. Unfortunely, I do not know Go (yet...) so I can't work on it. I am interested in learning if I ever get the time so maybe I could in the future...
The recently release chunking backend got me thinking about another related backend I think would be great: deduplicating backend
The idea would be basically as follows. At the top level, there will be two directories with the following structure:
blobs/ 00 01 ... FF files/ file1 dir/1 file2 ... ... fileXYZ
fileXYZ will contain a list of blobs, in order, that are concatenated into the file.
When a file is uploaded, it will be chunked (see below on details/issues) and then the new blobs are uploaded and a
fileXYZ file with the details is then created.
Additional Details and Issues
I am starting to think this through. There are still open questions but here is my initial take
Write is chunk > upload blobs > upload file text reference
Moves, copies, etc are the equivalent on the text reference
deletes mean you delete the reference but leave the blobs (for now)
read: download text reference, download blobs, assemble file
This could be a backend that doesn't support mtime and size and the like or it can encode metadata. The first idea would be to include it in the
fileXYZ text but then it has to be read to be parsed. This isn't too bad since they are small text files but it may be a lot of transactions. Alternatively, the actual filenames could be encoded (json?) with that metadata. This may run into file name length limits.
Purging involves downloading all
files/ (small text files so not too bad) and a list of all blobs. Then delete any unreferenced blobs. The downside of that is you really need to do it at the top level only since you need to know all referenced blobs.
Alternatively a new command could be introduced to do this.
An open issue is simultaneous access. Not sure if we need to address this anyway
Alternative file store
An alternative file store is to just have a single file (json, sqlite, another key-value store?) be all of the files but then you need to have an exclusive session. I am putting this out there but I do not think it is the right answer
Cost of Chunking
I think rclone should follow its current process as to whether or not to compute the hash of a local file. It really depends
I think there should be an (advanced?) option to set minimum and maximum chunk sizes. And the average size should be settable (though I think it can only take certain base2 numbers). If the chunk size changes, it'll mean the deduplication will likely not work across the previously-sized chunks but there would be no risk of data loss
I think I covered most issues. Like I said in the preface, I cannot (at the moment) write this. But if/when I finally learn Go, if nobody does it first, I would very much like to see this backend so I will try to do it myself.