Erasure coding: planning a Reed–Solomon virtual remote

I’m working on a Reed–Solomon–based erasure‑coding virtual remote for rclone and would like feedback on the design.

Motivation

Erasure coding splits data into (k) data shards and (m) parity shards; any (k) shards are enough to reconstruct the file, so you can lose up to (m) shards. Many storage systems (e.g. Backblaze B2, Ceph EC pools, Swift EC) use Reed–Solomon internally with layouts like 17+3 or 10+4 to get high durability with moderate overhead. In Go, github.com/klauspost/reedsolomon provides a fast, production‑grade implementation with systematic RS and streaming support over io.Reader/io.Writer.

There is already a PR for a simple RAID3‑style virtual remote: fixed 2+1 layout (two data shards plus one XOR parity shard) across three remotes, tolerating a single failure. It defines a self‑describing per‑shard header/footer so shards can be reconstructed even if the rclone config is lost, and it stores canonical hashes and last modification time inside each shard. This means the encoding does not depend on which backend features individual remotes support (hashes, mtimes, metadata), because that information is preserved in the shard payload itself.

Idea: sr Reed–Solomon remote

The plan is to generalize RAID3 into an sr virtual remote using Reed–Solomon:

  • For each file:
    • Split into (k) data shards.
    • Compute (m) parity shards, total (k+m).
    • Store each shard on a different underlying remote/object.
    • Read from any (k) available shards to reconstruct.

Configuration would follow the union style (one backend type, multiple instances with different params and upstreams): rclone

[sr-10-4]
type          = sr
data_shards   = 10
parity_shards = 4
upstreams     = remote1:bucket1 remote2:bucket2 ... remote14:bucket14
# placement_policy = spread | roundrobin | pinned (TBD)
# stripe_size      = 4M (TBD)

Different sr remotes can choose different (k,m) (e.g. 5+2, 8+3, 10+4).

Metadata and streaming

I plan to reuse and extend the RAID3 per‑shard metadata block so shards are self‑describing:

Core fields in the sr shard header/footer:

  • schemetype (rs-ec), schemeversion
  • objectid
  • dataparts (k), parityparts (m), partindex, stripesize
  • objectsize, padding
  • objecthashalgo / objecthash (canonical logical hash)
  • parthashalgo / parthash (per‑shard integrity)
  • objectmtime

Encoding would be streaming:

  • Choose a stripe size.
  • Loop: read stripe data into (k) buffers, call the RS encoder to generate (m) parity buffers, then write all (k+m) shard blocks to upstreams in parallel.
  • At EOF, finalize headers/footers with objectsize and padding.

The github.com/klauspost/reedsolomon API already supports this style.

Looking for feedback

Comments, design suggestions, and pointers to prior art are very welcome.

1 Like

I like this generalized raid3 idea.

It's going to need some care working with remotes that are down but context cancellations are useful here.

Some backend commands to repair missing shards also.

1 Like

As I’m working on a Reed–Solomon (RS) based virtual backend for rclone as a generalization of the current raid3 backend, I’d like to discuss two design questions up front:

  1. Should an RS backend use a pool of remotes per instance that is larger than k + m?
  2. How should we handle remotes that are temporarily or permanently down?

For context, the existing raid3 virtual backend has a fixed layout:

  • k=2 data shards
  • m=1 parity shard
  • l=k+m=3 remotes

Raid3 was originally designed to tolerate one failing disk in a set of three. If one remote fails or one shard is corrupt, raid3 can still fully read the data, but it refuses writes as long as a remote is down. That’s how RAID3 behaved on disks, and that’s how the current rclone raid3 virtual backend behaves.

With Reed–Solomon, large storage systems usually work differently. They pick a layout k data shards and m parity shards and then distribute those k + m shards across a pool of storage nodes or data centers, usually with l > k + m. The idea is:

  • Layout: choose k and m to tolerate up to x ≤ m failed shards/remotes.
  • Pool of remotes: have l possible targets, where l ≥ k + m.
  • Assignment: for each file, place its k + m shards on distinct remotes chosen from that pool.

If we only had l = k + m and one remote was down, we’d have to choose between:

  • refusing to write (like raid3 does today),
  • writing a known incomplete set with only k + m−1 shards, or
  • silently changing the layout for that file (e.g. from k + m to k + (m − 1) or (k − 1) + m.

All of these have unpleasant consequences for correctness and recovery.

In contrast, many storage systems use l > k + m so that per file they can always place a full k + m shard set, skipping any remotes that are currently unhealthy. As long as at least k + m remotes in the pool are up, they don’t have to change the per‑file layout.

Proposal

For an rclone RS virtual backend, I propose:

  • Each RS backend instance has a fixed layout (k,m).
  • Each instance is configured with a pool of remotes of size l, with the expectation that l > k + m.
  • For each file, the backend always writes a full set of k + m shards, choosing distinct remotes from the pool that are currently healthy.
  • If there are fewer than k + m healthy remotes at write time, the backend refuses the write instead of degrading the layout.

This means reads can still tolerate up to m missing shards per file, while writes require at least k + m available remotes from the configured pool.

The details of:

  • how to map shard index → remote, and
  • how to name the shard objects

can be discussed in follow‑up posts.

I’d appreciate feedback on whether the “pool larger than k + m” assumption matches the rclone community's expectation an RS backend to behave in rclone.

For the next step of the Reed–Solomon (RS) virtual backend design, I’d like to discuss two related questions:

  1. How does the backend discover which shards belong to the same logical file?
  2. How should shards be named on the underlying remotes?

File footer

As with the current raid3 backend, I propose to store a file footer in each shard that contains all information needed for reconstruction. This footer would also include the layout (k, m) so that the shard is self‑describing even if the backend config changes or is lost. The reason to use a footer instead of a header is to support streaming: key properties such as hashes, the final size and padding are only known once encoding finishes.

Placement of shards

For a first implementation, shard placement could be simple: shard index 0 goes to the first remote listed in the RS backend configuration, index 1 to the second remote, and so on. More advanced placement schemes could be added later, but a direct index→remote mapping is easy to reason about and debug.

When all remotes are available, the backend would:

  • read the footer on each shard found under the given file name,
  • verify that it has the expected set of indices for that file, and
  • confirm that all shards agree on hashes and timestamps stored in the footer.

Failing remotes / corrupt shards

If one or more remotes are down, the backend will miss some data shards and must read parity shards (indices ≥ k) instead. The same applies if footers disagree, e.g. hashes or creation timestamps differ, which might happen if:

  • an older set of shards with the same file name still exists on some remotes, or
  • a remote was previously available during an earlier upload but is now missing or stale.

In those cases, the backend would treat inconsistent or missing shards as erasures and reconstruct from any valid subset of k shards.

File names of shards

As with the rclone raid3 backend, I propose to use the same object name on each remote for all shards of a given logical file. Since we must read the footer anyway to reconstruct, and the footer carries layout and identity information, there is no strict need to encode shard index or layout into the object name itself. Keeping names identical also simplifies operations like listing and rm.


Dear rclone developers, do these design choices look reasonable for a first RS backend implementation?

Again, I’d be happy to get comments on this.

Proposal: shared EC (erasure coding) footer package for RAID3 and future RS backend

RAID3 adds a small EC footer to each particle file. The current footer is:

go

// EC footer constants (94-byte footer at tail of each particle)
const (
    FooterMagic   = "RCLONE/EC" // 9 bytes
    FooterVersion = 1
    FooterSize    = 94
)

// Layout: Magic 9, Version 1, ContentLength 8, MD5 16, SHA256 32,
// Mtime 8, Compression 4, NumBlocks 4, Algorithm 4,
// DataShards 1, ParityShards 1, CurrentShard 1, Reserved 4.
type Footer struct {
    ContentLength int64
    MD5           [16]byte
    SHA256        [32]byte
    Mtime         int64
    Compression   [4]byte
    NumBlocks     uint32
    Algorithm     [4]byte
    DataShards    uint8
    ParityShards  uint8
    CurrentShard  uint8
    Reserved      [4]byte
}

RAID3 uses this with Algorithm = "R3" and fixed DataShards = 2, ParityShards = 1, plus MD5/SHA256 and mtime so we don’t depend on backend hash/mtime support. The footer is appended inside each particle object (no sidecar), so each shard is self‑describing and reconstructable even if the rclone config is lost.

For the planned Reed–Solomon backend (type = "rs", using github.com/klauspost/reedsolomon), I’d like to reuse exactly the same footer format but with Algorithm = "RS" and variable DataShards = k, ParityShards = m.

Structurally, I’m considering something like:

text

backend/
  raid3/   // type = "raid3"
    raid3.go
  rs/      // type = "rs"
    rs.go
lib/
  ec/
    footer.go  // FooterMagic/FooterVersion/FooterSize + Footer struct

Both backends would import lib/ec and share the EC footer; RS would add the actual Reed–Solomon logic on top.

rclone’s usual pattern is “one backend per directory under backend/ plus shared helper packages”, so the backend/raid3, backend/rs and shared lib/ec layout is meant to follow that convention. We mainly want to check whether there’s any reason this assumption is wrong or if maintainers would prefer the shared EC footer to live somewhere else.

Any objections to using a single RCLONE/EC footer format across multiple EC‑style backends, distinguished by Algorithm (e.g. "R3", "RS") and FooterVersion?

I’ve implemented a virtual Reed–Solomon backend for rclone: https://github.com/rclone/rclone/pull/9301. I kept the initial design deliberately small and focused, and I’d really appreciate any early feedback or review.

Just for documentation: after some more research I found that some storage systems use a quorum-style approach for failed or unavailable nodes. They typically generate k + m shards, but only require any k + x shards to be writable at a given time. This is the model I followed for the new Reed–Solomon virtual remote.

I don't have any suggestions right now, so I just wanted to thank you for your work.
I'm busy with other project right now, but I hope to use your backend in the future.
Is data recovery already supported?

Thanks for your encouraging message.

Yes, the RS remote is intended for backup and restore scenarios once it makes its way into an rclone release.
It is designed so that each file can be reconstructed as long as at least k of the k + m shards are still readable.
Each shard carries a footer with layout information and hashes, so even if one ore more remotes disappear you can still read from the remaining ones and recover the file.

There is also a heal command which can be applied to a single file or to an entire namespace.
The current implementation looks for shard sets with missing or invalid shards where at least k shards are still valid, and then reconstructs the full k + m set. It can also be used to heal data after replacing up to m underlying remotes, as long as at least k shards per file remain readable.

Right now there is no threshold like “accept (k + m) - 1 valid shards but heal files that fall below this,” but this is something that could be added if there are good use cases.

I’d be very interested to hear expectations or wishes for the heal command once the pull request is under review, so we can adjust the design based on real‑world workflows.