I’ve started work on a new virtual remote for rclone called raid3. It distributes data across 3 different remotes, so if one part is corrupted or a remote is unavailable/compromised, the data is still accessible.
While raid3 is simple and fast, I’d like to generalize the idea as suggested by core rclone experts using Reed–Solomon erasure coding for a more flexible “k‑of‑n” scheme. There was a related discussion back in 2019 about this: Creating PAR2 files for damage recovery?
Given that:
- there is an excellent, high‑performance Reed–Solomon implementation in Go (
github.com/klauspost/reedsolomon), and - rclone has a clean, extensible backend/virtual‑remote design,
a Reed–Solomon‑based virtual remote for distributed storage looks very feasible.
How Reed–Solomon works
For a single rclone “file”:
- Fragmentation: Split the file into
kdata shards. - Expansion: Use Reed–Solomon to compute
mparity shards, so total shardsn = k + m. - Distribution: Store each of the
nshards on different underlying remotes. - Reconstruction: Any
kout ofnshards are sufficient to reconstruct the original file.
This would generalize distributed and fault-tolerant storage.
Metadata / format questions
To make this robust and self‑contained, each file’s shards must carry enough information to be reconstructable even if the rclone config is lost or changed.
Core Reed–Solomon per‑file metadata (needed for decoding):
k– number of data shardsm– number of parity shards- padding info (how many bytes of padding in the last shard)
Potentially also algorithm options.
Core rclone‑level metadata to preserve per file:
mtime(original modification time)- hashes (e.g. original hash), if available/needed
Config vs on‑disk format
The virtual Reed-Solomon remote config defines:
- which underlying remotes are used,
- default for
mand maybe other tuning parameters.
However, if the config is lost, the stored shards themselves should still be self‑describing enough for reconstruction.
This leads to this design question:
- Should the metadata be embedded in each shard (header/footer inside the object)?
- Or should we use a sidecar object per file for metadata (with some recovery plan if the sidecar is lost)?
Embedded metadata gives per‑shard self‑containment and atomicity on object stores; sidecar metadata keeps shards “clean” but introduces extra failure modes.
Looking for feedback
Comments, design suggestions, or pointers to prior art are very welcome.