Creating PAR2 files for damage recovery?

My general pipeline is tar -> zstd -> split -> gpg and I’m backing up to Backblaze B2. Given that they have Reed-Solomon redundancy and damage repair already running on their end and rclone verifies checksums upon upload, is it worth it for me to construct par2 files on my end and upload them as an additional safety measure? It feels redundant and probably is too much, but I’d like to get some second and third opinions on this.

I was thinking something like 50% redundancy, which would mean 150% of my current storage. Given that I have almost 1TB, I’d by paying an extra $2.50 a month just to store the redundancy files (granted, this is a very rough estimate, since the amount I’m storing will definitely go down due to compression, but think of this as an upper limit).

The other thing that makes this probably useless is that I’m running this on the split+encrypted files. An actually useful thing might be running it on the tar file before passing it to zstd (or after zstd but before passing it through split and gpg). Would that make more sense given that the main integrity right now is of the split+gpg-encrypted files? It would definitely increase time to encrypt and disk space used unless I inserted par2 between split and gpg, but if it’s a useful integrity measure, it might be worth the increased runtime + storage costs. This would basically only protect against RAM corruption during compression+encryption, though, since the only files ever written are the compressed, split, encrypted files. Thoughts?

What are you protecting yourself against?

If you use 50% redundancy then you can lose 50 out of 150 chunks and still have all your data. I think if B2 is going to lose your data then it will be either

  1. an object or two lost or
  2. all your data lost

Here is what backblaze have to say about durability

They quote 99.999999999% which means that you can expect to keep that % of your files per year. So if you had 100 billion files you could expect to lose 1 a year.

So 50% redundancy is way to much to protect against that.

However there is option 2 - b2 loses all your data. Let’s say the datacenter burns down. Is that more or less likely? I don’t know…

So I’d either go for much less redundancy, or use a second provider and store a complete copy if you want to reduce the probability of losing data further.

The way I see it, I’m trying to protect against a couple of separate things:

  1. File corruption server-side after I transfer it. This should be adequately handled by their Reed-Solomon redundancy and parity files.
  2. Corruption of the initial archive during creation - I suppose I could just do a test decompress locally before uploading.

So I guess either way, par2 files don’t make much sense. Locally decompressing (or at least testing decompression) would be a good way to ensure my archive is at least reasonable, but requires decrypting tons of GPG keys using my poor Yubikey…testing incremental updates is probably more reasonable.

Makes sense. B2 is already my 3rd backup - I have my files on my laptop and a local hard drive I backup to fairly often. What I’m thinking though is that if it’s easier for me to just push up incremental changes every morning or whatever, then I can basically reconstruct all my files through a combination of local restore (the last time I fully backed up my laptop locally) and pulling the incremental updates I made since then from B2.

Huh, thanks for the sanity check :slight_smile:

Note that you can use rclone check to check local and remote checksums. And if you are feeling really paranoid rclone check --download.

That is hard to guard against, certainly.

I’m a big fan of MD5SUM files alongside the data as a last resort!

1 Like

Right. I should probably just integrate that into my workflow before I delete the gpg files on my end :smiley: Definitely not running --download (this is B2 - I’d pay for the download lol). It’s slightly more complicated because there are a bunch of encrypted files I uploaded that I already deleted locally, so I’d have to either re-generate (and thus re-upload due to the changed checksum) them or trust that the checksums matched upon initial upload, so it should be fine. :man_shrugging:

For what it’s worth, I decided to stick to creating detached signatures for each file to (hopefully) detect any tampering. I’ll leave damage recovery up to them, since I suspect any redundancy I create on my end will only end up adding a couple more GBs and not add anything on top of what they’re already doing.

For me, backups are all about conquering my paranoia with statistics :wink:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.