Block-Level Deduplication + Compression Remote

Hello,

I've been a fan of swiftsync rclone for a lot of years, and I'm also a fan of Kopia.

Kopia is a CLI backup tool (also written in Go) that performs block-level deduplication and compression on the contents. Data is managed as snapshots, which is quite a different architecture compared to rclone.

I've always wanted an rclone remote that can deduplicate data to maximize my storage usage.

Recently I took inspiration from how kopia splits contents to make my own rclone remote called "dedup."

The dedup backend is a new rclone overlay backend that wraps another remote and provides block-level deduplication and compression. It uses content-defined chunking (rolling hash) to split files into variable-size chunks, hashes each chunk with keyed BLAKE2b-128, compresses with zstd, and stores only unique chunks on the underlying remote. A JSON manifest file per logical file records the chunk list needed to reconstruct it.

This is similar to Kopia's content-defined chunking, but without snapshot management โ€” it's a transparent overlay. Files appear normally, and deduplication happens automatically behind the scenes.

Results

Original Data Set

Total size: 917.384 MiB

Dedup remote (no compression)

Total size: 690.422 MiB

Dedup remote + zstd 22

Total size: 426.619 MiB

For comparison, this is the current rclone compress remote with maximum zstd compression:

Total size: 671.503 MiB

I did wrap a compress remote with my dedup remote as a test but encountered very many file copy errors for some reason so I don't have real results for that test.

I rclone copy'd the data back to a new directory and ran a diff on the original directory against the newly downloaded one and they are identical.

I have admittedly done very little testing and consider it to be highly experimental / proof-of-concept. I'm unsure about data-loss risk with the dedup remote, but I thought it was a cool result and wanted to share.

Kind regards,
Matt

EDIT: Here's a screenshot of the same results as above.

EDIT2: Here's a (boring) video of me interacting with the dedup remote a little.
https://asciinema.org/a/tv01yDe8ppJVBXBo

1 Like

hi, that sounds like it might have potential, keep working on it.

how would your solution be different from
rclone_serve_restic
or
restic ยท Using rclone as a restic Backend

1 Like

Great question! Kopia also can use rclone as a backend such that your snapshots can sync to any target that rclone can interface with.

The defining difference between restic/borg/kopia/backuppc/duplicati/duplicacy and my dedup remote is snapshotting.
All those other solutions take a snapshot of a dataset and store it remotely.
If I delete some files on the local data set to free up space, the next snapshot does not include those deleted files.

Kopia has a way of mounting snapshots and you can even browse multiple snapshots simultaneously but it's a bit more cumbersome than interacting with rclone.

The benefit to the dedup remote is that I can rclone move files with no "layering" due to the snapshot management features that the other softwares have. Anything I upload is all visible in one place, essentially.

EDIT: I actually asked on the restic forums in 2020, about a feature like this: Flat storage (Snapshotless repository) - Getting Help - restic forum
The best solution was for local fs only, and doesn't help with deduplicating remote storage.

ok, good luck with it

1 Like