Data integrity when doing backups to B2 or S3

rey · November 21, 2017, 4:06am

I am doing backups to backblaze b2 using rclone to encrypt them client side. I am worried about data integrity. I haven’t been able to figure out a good way of checking if data has been corrupted either on my b2 backups or on my local storage. These are the questions I have:

How can I identify files that have been corrupted or damaged in the b2 side so I can re-upload them?
How can I identify files that have been corrupted or damaged in my local backups so I can download a fresh copy from b2?
Is this something I should worry about or should I just assume that b2 takes care of protecting the data and fixing any issues (e.g. they have something in place to protect data against bit rot)?
Will s3 be better at guaranteeing data integrity?

Any suggestions?

trvsysadmin · November 21, 2017, 7:20am

Have you read rclone check?

check command will perform MD5 or SHA check together with modified time and file size. Check here for what hash is supported by backblaze and S3 - https://rclone.org/overview/ . It does NOT modify src or dst.

Secondly, to perform automated syncing, you can use rclone sync together with a permutation of the --size-only, --checksum, --ignore-checksum, --ignore-existing, --ignore-size, --ignore-times, --update, --no-update-modtime flags. You would need to read up both the forum, github issues and docs while performing some experiments on understanding the exact behavior of these flags.

As for the topic of bit-rot, by guarantee, you are speaking about a monetary and commercial insurance, which is covered in https://aws.amazon.com/s3/sla/ . Backblaze would have an equivalent SLA coverage. However, unless you are have large amounts of data (10, 50 TB, 1PB etc.) with mission critical applications or you have to provide SLAs to your customers, you probably would not need to build policies and systems that are synchronized with the S3 SLA. Even if you were a very serious server admin, you would only rely on the SLA to a certain degree and then build other forms of redundancy systems (e.g. replication different bucket in same region, different bucket-different-region, duplicate to glacial storage etc.). For all cases of personal photos, small business applications etc., you can just double or triple backup like you have done now (S3, Backblaze, GDrive, local NAS). It would be way too complicated to mirror your system to the intricate details of what the standard SLA really mean and it really depends what kind of loss you have and what recourse you want.

S3 and google “may” be better considering they have larger datacenters and more engineers with enterprise class customers who demand custom SLAs. So their infrastructure quality will spill over to normal/standard users. But this still doesn’t mean you should think about who is better: redundancy and holistic thinking is your best line of defence. You may have a higher chance user-based-accidents than machine level issues (e.g. you rm -rf accidentally, which has happened to the best of us).

Hope this helps. You will have to ask more specific questions on the rclone sync issues after you have performed some experiments with the flags and options on some small set of files. You may need to write some python or bash scripts or cron jobs to aid you - rclone alone may not be enough.

ncw · November 22, 2017, 2:47pm

Assuming you are using crypt, then rclone cryptcheck will warn you about differences. It will take a bit of detective work to work out which is corrupted the local or remote, but if you can't download the remote then it is likely corrupted as crypt has a very strong authenticator in every 64kB block.

If you care about your data you need to worry about it - that is what rclone check and rclone cryptcheck are for.

Better still make 2 copies with different cloud providers!

S3 is much bigger than backblaze. Backblaze have a much simpler architecture. I wouldn't like to guess!

rey · November 22, 2017, 3:43pm

Thanks for your response! I am currently using crypt. I have some crypt related questions based on your response:

When using crypt, should I only use rclone cryptcheck or should I also use rclone check? If both, what’s the difference between them?
Does rclone cryptcheck downloads the object to perform the checks or is it all local?
Are there still issues with B2 and large files when using crypt? I saw a github issue about it but I’m not sure if it has been fixed.

Thanks!

ncw · November 22, 2017, 8:47pm

check when used on a crypted remote won't be able to check the checksums. It will check the file presence and size and it will do that cheaply.

cryptcheck will check the checksum, but it will have to download the start of each file (20 bytes) and then encrypt and checksum the local file to check it.

Yes, if you send large files to B2 via crypt then they won't have a checksum so can't be checked with cryptcheck. This is unfortunate, but needs an API fix - this is the issue: Large files uploaded via crypt remote to B2 are missing SHA1SUMs · Issue #1767 · rclone/rclone · GitHub

rey · November 22, 2017, 10:11pm

If you don’t mind me asking, I am a bit confused about the checksums in B2. I was under the impression that they computed the SHA1 checksum on their side to verify it matches the one provided when uploading the file. If this is the case, shouldn’t all files have checksums? Is rclone doing this or is it something else? Also, why do smaller files get checksums and larger files don’t? Is there some threshold in the code?

Thanks again!

ncw · November 23, 2017, 4:47pm

Absolutely true for small files.

However for large files above this limit, each chunk has an SHA1 and is checked, but it is up to the client to supply the SHA1 for the whole file.

--b2-upload-cutoff int              Cutoff for switching to chunked upload (default 190.735M)

However rclone needs to know the SHA1 in advance to send it, which it does for non-crypted files, but not for crypted files (as explained in the issue).

I hope that is a bit clearer.