Corrupt files on s3, possibly because larger than 5Gb


#1

I’m storing encrypted backups on S3 provided by Dreamhost. Using rclone v1.39 from OS X.

Due to a change at their end, my data had to be moved to a new cluster. There were three files which could not be moved. Notably, these are the three largest single files I have, and the only ones larger than 5Gb.

Here’s what the last message from Dreamhost support said:

Our Cloud Engineers believe that those files are corrupted, most likely
from when they were first uploaded.

I’ve asked them for more detail on what they mean by “corrupted”, but it might take some time (Dreamhosts’ support isn’t the fastest) so I thought I’d post here to see if this might be a known rclone limitation or usage error.

Thanks.


#2

Dreamhost use CEPH to store files. This issue might be: https://tracker.ceph.com/issues/15886

When did you upload the files?

rclone uses the s3 SDK provided by amazon to upload files (and has done since early 2015) and I haven’t had any other reports of this. So I suspect this is something to do with CEPH or rclone’s interaction with it.

I tried uploading a 10G file to CEPH and downloading it again just now and all was well just as a sanity check!


#3

The files were uploaded in March of 2018.

This is beginning to look more like a problem with their migration tool than a problem with the upload. Downloading the files w/ rclone worked perfectly. (Don’t know why I didn’t try that first! Guess I trusted their claim that the files were corrupted.)

Meanwhile, the support team asked me for the md5 sums of those files (which was kinda tricky to compute, as they’re encrypted with a random nonce). I supplied those, and this was their response:

Hmm. Curious. Those md5sums don’t match the md5sums currently in the
US-West 1 cluster version of the bucket, but they do match what we get
when we download from the US-West 1 cluster to a Linux server. Since
these objects are greater than 5 GB, a multi-part upload would have been
required, so maybe the md5sums aren’t going to match and we can just
force the migration anyways, but we’re still working with our
DreamObjects engineers to verify that fact.

I don’t know what to make of “maybe the md5sums aren’t going to match”.

Thanks.


#4

Oh, that is good news :smiley:

s3 only keeps md5sums of single part uploads, so there isn’t really a canonical way of asking s3 for the md5sum of a large file. You can ask for the ETag which is an md5sum of md5sums of the parts but how you make that depends on how big the chunks are.

That might be what they are talking about.


#5

Update from Dreamhost:

Our DreamObjects engineers have finally been able to finish their
investigation into why those three objects were not migrated. But first a
brief discussion of terms. DreamObjects is built on Ceph which utilizes a
RADOS Gateway as the in/out access point for all data. Due to a various
assortment of RADOS Gateway bugs, the initial upload of those three
objects did not actually complete successfully, but the aforementioned
bugs combined in a unique way that caused Ceph to incorrectly catalog the
objects in the bucket index. Unfortunately due to those bugs, we are not
able to recover those objects.

So, it looks like rclone is off-the-hook.


#6

Thanks for the update. Dreamhost have been CEPH pioneers over the years so I guess an incident like this is possible. CEPH is now super reliable and powers a lot of object storage systems (eg Digital Ocean’s) so cross fingers this is ancient history.


#7

Well, unfortunately, the same files are failing on the new cluster.

The support team’s not-very-encouraging email:

I’m getting inconsistent results when verifying those new uploads.

“inconsistent results”?

Not confidence-inspiring.


#8

No :frowning: I hope they resolve it.