Solving the Failing Remote Problem — New Virtual Backend: cRaid3 (Request for Comments)

Intro (topic summary)

We’ve built a new virtual backend for rclone called cRaid3, combining three remotes into one fault‑tolerant storage system. It’s an early implementation, and we’d love your feedback, tests, and design input!


Solving the Failing Remote Problem — New Virtual Backend: cRaid3 (Request for Comments)

Dear rclone community,

Hard disks fail. That’s why we have RAID — multiple drives working together so that when one goes down, your data stays safe and accessible.
The same principle applies to cloud storage: an account can get compromised, a provider can disappear, or access to a geographic region, or even to entire organizations like NGOs or companies, can suddenly be blocked. When that happens, both current and historical data may be at risk.

To address this, we built cloud raid3 or cRaid3, a new virtual backend for rclone that combines three remotes into one fault‑tolerant storage system.


How it works

Imagine you have storage providers in the US, New Zealand, and France.
You bundle them into a single virtual remote called safestorage and use it like any other remote:

$ rclone ls safestorage:

If the New Zealand provider fails, all your data remains fully accessible for reading.
safestorage reports which backend is missing, and rebuilding uses only the data stored on the two working systems.
You can then set up a new provider in Australia, update your rclone.conf, and rebuild:

$ rclone backend rebuild safestorage:

That’s it — safestorage is ready for storing data again and your data is fault‑tolerant again.


Technical details

RAID3 splits data at the byte level across three backends:

  • Even‑indexed bytes → even remote
  • Odd‑indexed bytes → odd remote
  • XOR parity of each byte pair → parity remote

If one backend fails, the missing data is reconstructed from the other two:

  • Missing even → computed from odd XOR parity
  • Missing odd → computed from even XOR parity
  • Missing parity → recalculated from even XOR odd

This provides fault tolerance with only ~50 % storage overhead.


Demo available

Integration test scripts and a setup helper are included in backend/raid3/test and backend/raid3.

Make sure to go build at the root of the forked rclone before testing.
If you have MinIO running in Docker, the provided config also includes a minioraid3.


:speech_balloon: Request for feedback

This is a pre‑MVP — currently slow — but functional and ready for experimentation.
We’d appreciate feedback from the community, especially on design questions such as:

  • What should rclone size return — original data size or total across all parts?
  • How should rclone md5sum behave — should we store the original file’s checksum explicitly?
  • Could the chunker or crypt virtual remote wrap the cRaid3 remote?

Or simple questions like: Should we call it cRaid3 or just raid3? The current pre-MVP is just called raid3.

The pre‑MVP is available for download and testing here: GitHub - Breschling/rclone: "rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Azure Blob, Azure Files, Yandex Files .


Why RAID3?

RAID3 is amazingly fast, simple, deterministic, and state‑light.
In traditional disk arrays, the parity disk was a bottleneck — but in cloud storage this limitation doesn’t exist, making RAID3 an ideal starting point for reliable, multi‑provider redundancy.


Future directions: more flexibility and encryption?

As we refine raid3, we hope to identify what’s needed for stable, high‑performance distributed backends in rclone.
If the community finds this approach useful, we plan to explore more advanced (but probably more demanding) options such as Erasure Coding and Threshold Encryption (see the 2021 forum topic “Can we add erasure coding to rclone?” between @hvrietsc (Hans) and @ncw (Nick)).


Comments are very welcome.


hello,

that sounds like a very, very complex contraption for a simple copy tool such as rclone.
imo, would find it difficult to ever trust that with valuable backup files.

fwiw, try to re-use existing rclone wrapped remotes such as chunker and union.

Hello,

Thank you for the feedback and for taking the time to look at the proposal.​

The union and chunker backends were important design references for this idea.

  • Union: presents multiple upstream remotes as a single backend, giving a unified namespace over several providers.

  • Chunker: splits individual files into fixed-size chunks to bypass per-file size limits on a single backend.

Raid3 follows both patterns but adds cross-remote redundancy and recovery:

  • Like union, raid3 exposes several remotes as one logical backend, so existing rclone workflows can treat it as a single remote.

  • Like chunker, it splits each object into parts, but these parts are then distributed across multiple independent upstreams instead of stored on just one.

Raid3 adds:

  1. Deterministic striping: each file is split into data and parity stripes and written across three remotes, so any one remote can fail without losing readability.

  2. Disaster recovery: when one remote is lost or corrupted, the backend can rebuild the missing data by reading the remaining stripes and parity, and then repopulate a replacement remote.

Minimal state: layout is derived from the object path and fixed parameters (number of remotes, stripe size), so no central metadata service or external index is required. This keeps the design simple and robust for long-term use. The post provides some implementation details and the goal is to keep the failure model and rebuild process predictable, while still fitting naturally into rclone’s existing backend model.

1 Like

it sounds great. good luck with it!

The idea is neat, but not sure byte level is a good approach. Your idea shows different geographical areas, so response time between providers will take place so your remote be as slow as your slowest provider.

There’s also the cloud provider nature, were the slow portion is the API response time to fetch a file, transfer will be fast, but initiating the transfer will be slow. If we need to do that on a byte level it would be crazy slow and most likely even be rate limited by multiple providers.

Also, there’s a lot of discrepancy on hash support along providers, so not sure how you plan on implementing that and keeping track data corruption.

I think it would be a better bet for providers to implement that in their products. A lot of S3 compatible providers already offer this.

Raid performance is very dependent on using the same hardware, where performance can easily tank due to one element being slower than the rest. This issue is the nature of cloud storage, so would love to hear how you plan to address it. I get the focus of the remote is not speed but resilience, but if the performance impact is too big, it might become impractical.

Dear Jose,

thank you for the detailed feedback. Let me share some aspects.

Parallelism and latency
All three remotes are opened and written concurrently, so you do not pay a 3× sequential cost; you pay max(remote latency) plus per‑remote overhead.

Each data remote only sees 50 % of the bytes, which helps to offset that overhead.

Data upload in parallel is similar in spirit to S3 multipart uploads, which increase throughput by running multiple parts in parallel.

The goal is not to beat a single remote on raw speed, but to stay operational under one‑remote failure.

“Byte‑level” and streaming
“Byte‑level” here is a layout property only: index‑even bytes → remote A, index‑odd bytes → remote B, XOR(even, odd) → parity remote.

I/O is still done as buffered streams; rclone aggregates data and flushes in chunks, so there is no per‑byte remote initialization.

Why byte interleaving?
Example, 999‑byte object:

even: 500 bytes

odd: 499 bytes

parity: 500 bytes

If the odd remote dies, reconstruction must know the original logical length of the object to stop at the right byte; with larger “chunk” blocks you would need extra metadata (block tables, padding rules, etc.) to make this unambiguous.

A byte‑interleaved layout plus a 1‑bit length flag encoded as an file extension the parity object name

*.parity-el → original length is even

*.parity-ol → original length is odd

is sufficient to reconstruct any single missing stream from the other two, with a very simple format and recovery procedure.

1 Like