Can we add erasure coding to rclone?

I would like to know if it is feasible to add erasure coding to rclone? What I am thinking of is that we create a new remote type called erasure-code. For this type we define for example 3 of 5 meaning we encode the data block to be written to the remote such that it will be striped over 5 clouds but we only need 3 clouds to be able to recreate (read) the original data. This will add reliability as well as encryption since the original data can only be reconstructed if 3 of the 5 clouds "cooperate".

"All" that needs to be added is code to: (in this example) encode the data block to be written into 5 pieces and then send these pieces to the 5 clouds (instead of just one). On reading we send requests to all 5 clouds and use the first 3 answers we get to recreate the data.

Little bit of overhead but we add a lot of reliability and "encryption".

1 Like

This is certainly technically feasible as a new backend.

We have the union backend which can do something like RAID1 mirroring as in copy the same data to multiple cloud storages.

RAID1 is technically an erasure code. But it is not very space efficient and does not provide any "encryption". An erasure code remote: would stripe a block of data across N remotes (or locals for that matter). Such that only M (< N) of the N remote copies need to be accessible. In those cases where M > 1 each of the data blocks stored on a single remote will be "encrypted" because you will need at least M-1 other remotes to be able to restore the original content.

I would love to see an implementation of an erasure code type of remote. MY guess is the code will be similar to the UNIONfs remote that already exists. Probably simpler maybe because no policies are needed like with UNIONfs.

Another advantage is that a read from a erasure code remote can be much faster than from a single remote.

I have a soft spot for erasure codes (earlier in my career I used to design forward error correction codes for satellite communication).

I think an erasure backend is perfectly feasible. Maybe as part of the union backend, but maybe not.

There is even a nice go library for erasure codes written by Klaus Post (whos is also an rclone maintainer).

Do you want to have a go at this?

I would love to have a go (pun intended) at this but I do not have the skills to work on a go project. Back in 2006 I was the first person to use the then newly announced Amazon S3 to create an infinitely large disk device (that never breaks). (see https://forums.aws.amazon.com/thread.jspa?threadID=10271). So basically a precursor to "rclone mount". All that was written using a combination of C and Python. Separately I worked on an erasure code project. I never put the two together. I might be able to help with the design and maybe I can do some beta testing. Have you asked Klaus Post if he is interested to undertake this?

I would suggest not making this part of the UNIONfs remote since it serves different purposes. The UNIONfs is about merging different cloud storage into one bigger one. The erasure-code remote (M out of N where M<N, and M>1) would do the following:

  1. keep all of your data readable even if some clouds are unavailable (the other day both OneDrive and GoogleDrive had major outages (fortunately not at the same time)).

  2. keep your data safe from snooping by one cloud provider. Rumors are that OneDrive is easily hackable. With the data erasure encoded you will need to hack more than one cloud to be able to read the data (M>1).

  3. increase download speed. First of all because the data is striped and secondly we only need the first M "replies" to the request for download. So we effectively use the fastest responding clouds.

Let's keep designing this and I would love to hear from others if this would be of interest to them.

Can you open a new issue on Github about this?

If you can write up the design ideas so far, then I can ask Klaus if he would like to work on such a thing.

Yes I agree - an erasure backend should be separate to the union backend.

Ideally it should cover the RAID 1 mirror case too, so 1 data + N replicas.

GitHub issue entered at: new erasure remote: type to increase availability, download speed and security · Issue #5267 · rclone/rclone · GitHub

1 Like