Corrupt files on s3, possibly because larger than 5Gb

barkofdelight · September 19, 2018, 3:08pm

I'm storing encrypted backups on S3 provided by Dreamhost. Using rclone v1.39 from OS X.

Due to a change at their end, my data had to be moved to a new cluster. There were three files which could not be moved. Notably, these are the three largest single files I have, and the only ones larger than 5Gb.

Here's what the last message from Dreamhost support said:

Our Cloud Engineers believe that those files are corrupted, most likely
from when they were first uploaded.

I've asked them for more detail on what they mean by "corrupted", but it might take some time (Dreamhosts' support isn't the fastest) so I thought I'd post here to see if this might be a known rclone limitation or usage error.

Thanks.

ncw · September 20, 2018, 2:18pm

Dreamhost use CEPH to store files. This issue might be: Bug #15886: Multipart Object Corruption - rgw - Ceph

When did you upload the files?

rclone uses the s3 SDK provided by amazon to upload files (and has done since early 2015) and I haven't had any other reports of this. So I suspect this is something to do with CEPH or rclone's interaction with it.

I tried uploading a 10G file to CEPH and downloading it again just now and all was well just as a sanity check!

barkofdelight · September 20, 2018, 4:41pm

The files were uploaded in March of 2018.

This is beginning to look more like a problem with their migration tool than a problem with the upload. Downloading the files w/ rclone worked perfectly. (Don't know why I didn't try that first! Guess I trusted their claim that the files were corrupted.)

Meanwhile, the support team asked me for the md5 sums of those files (which was kinda tricky to compute, as they're encrypted with a random nonce). I supplied those, and this was their response:

Hmm. Curious. Those md5sums don't match the md5sums currently in the
US-West 1 cluster version of the bucket, but they do match what we get
when we download from the US-West 1 cluster to a Linux server. Since
these objects are greater than 5 GB, a multi-part upload would have been
required, so maybe the md5sums aren't going to match and we can just
force the migration anyways, but we're still working with our
DreamObjects engineers to verify that fact.

I don't know what to make of "maybe the md5sums aren't going to match".

Thanks.

ncw · September 21, 2018, 9:21am

Oh, that is good news

s3 only keeps md5sums of single part uploads, so there isn't really a canonical way of asking s3 for the md5sum of a large file. You can ask for the ETag which is an md5sum of md5sums of the parts but how you make that depends on how big the chunks are.

That might be what they are talking about.

barkofdelight · September 26, 2018, 4:58pm

Update from Dreamhost:

Our DreamObjects engineers have finally been able to finish their
investigation into why those three objects were not migrated. But first a
brief discussion of terms. DreamObjects is built on Ceph which utilizes a
RADOS Gateway as the in/out access point for all data. Due to a various
assortment of RADOS Gateway bugs, the initial upload of those three
objects did not actually complete successfully, but the aforementioned
bugs combined in a unique way that caused Ceph to incorrectly catalog the
objects in the bucket index. Unfortunately due to those bugs, we are not
able to recover those objects.

So, it looks like rclone is off-the-hook.

ncw · September 27, 2018, 9:57am

Thanks for the update. Dreamhost have been CEPH pioneers over the years so I guess an incident like this is possible. CEPH is now super reliable and powers a lot of object storage systems (eg Digital Ocean’s) so cross fingers this is ancient history.

barkofdelight · October 9, 2018, 5:45pm

Well, unfortunately, the same files are failing on the new cluster.

The support team's not-very-encouraging email:

I'm getting inconsistent results when verifying those new uploads.

"inconsistent results"?

Not confidence-inspiring.

ncw · October 10, 2018, 12:54pm

No I hope they resolve it.

barkofdelight · October 22, 2018, 12:17am

At this point I am out of words to describe my frustration.

Here's the latest from Dreamhost Support:

Unfortunately, at this exact moment I'm not sure how we can
definitively verify that those uploaded files are not corrupt. I mean, it
should be noted that that message from last month was from a time when it
was not clearly understood by myself that multipart objects literally
could not have a matching md5sum due to the way in which the objects were
being assembled/uploaded. But that still doesn't answer your question.

And with my limited knowledge only being slightly increased to understand
the whole md5 multipart ETag situation, the best I can do for you right
now is try a Google search for "verify s3 upload" and forward your
question to our DreamObjects engineers for review. Both of which I've
done now (although the Google search is preliminarily only showing me
articles and forum discussions about calculating the md5sum before
uploading and attaching it to the upload as metadata).

I don't know what to do next. Aside from threatening to close my account and move my data to a service that could actually tell me if my files are corrupt (a threat I have already made), I don't know how to get this escalated.

Sigh....Posting on the dreamhost forum next....