A number of "corrupted on transfer: sizes differ" errors over past few weeks during sync from Swift (Rackspace) cloud to local

patakijv · September 12, 2022, 8:00pm

Thanks for your detailed reply. I will research the link on eventual consistency problems you provided and see if leads me to a support ticket with Rackspace.

Perhaps the logs or my updates here haven't be enough to spell this out but actually the problem persists frequently enough where two situations occur:

at times even after 6 retries it will fail and I can see the file size shown in the message in the log switch back and forth as two different numbers that flip flop as server/local
at times even when it says it is successful, there are downstream file content monitoring tools that suggest the local file didn't actually get updated (because content inside the file is older than a threshold of minutes since last local file)

patakijv · September 13, 2022, 3:57am

Ok, I was able to get a support ticket research into this problem from the Swift cluster side of things (Rackspace) and they were able to see there was a problem writing to one of the nodes on the backend side of the cluster.

From responses:

"Cloud Files stores your files in triplicate. When reviewing the logs for the most recent upload for the file in question I did see that when it attempted to store the file to one of the three nodes, it had troubles due to lack of disk space."

"I believe the disk space issue would have been on one of the backend storage nodes. That one particular node may not have been reporting its health correctly leading to the problem and why it was missed. It would have affected other writes to that node. The backend storage related issues would not have presented itself as any meaningful message from the public API point of view."

Based on this, I suppose what happens is when the API calls rclone is making for checking the server side info as well as when it actually downloaded the file it is a random wheel of which of the 3 nodes is responding. Which would explain not only the wrong meta info but also when it apparently succeeded (because the download and meta info were from the same node - the node with the old file) it was then failing the content check downstream because it was indeed an old file from that problematic node where the backend write failed.

I am curious if anyone here knows anything similar or related should be inspectable or detectable through the Swift API to determine that something like this is possible soon (a file size check on nodes) or that errors are being thrown or something else that we might be able to do in the future differently to catch that this is happening or that it might happen soon (before hand).

It is a little frustrating that this was not determinable directly through interfaces (api, web, etc) provided by the cloud storage provider / system but instead only indirectly through rclone but I guess that is also a point for rclone and its (default) process of not ignoring that something is wrong through its checks. I wonder if there is a way to include a possible list of things to look for when things like this happen somehow.

Since it is after hours now, we will see if this issue is now resolved tomorrow and the following days by Rackspace's find of this issue on the backend.

ncw · September 13, 2022, 2:38pm

That makes sense. It would be luck which node you got the file from.

The swift cluster should detect this scenario and fix the data and or drop the node from the cluster.

No, there is nothing that lets you know about the underlying storage architecture in the Swift API, and that is deliberate on the part of the API as the underlying storage architecture often changes as nodes are added, data is migrated, etc and it should make no difference to the user.

I'll take that as a win

Hopefully rackspace will give that server a poke and all will be well.

patakijv · September 14, 2022, 4:51pm

The problem appears to be resolved now (no more corruption potential flags for this file) since Rackspace updated the server node (increasing the disk space and I presume fixing the health monitoring notification system that failed to alert them of the full disk drive).

@ncw yes indeed - definitely a win for rclone

To summarize for future readers: by rclone refusing to transfer the file because the remote endpoint was giving it mixed information for the file in question as compared with downloaded file whenever the problem node (one of 3 nodes where the file object is stored) was either the node the file was downloaded from or when it was the node where the server side info was retrieved from yet the two instances were different, we were able to eventually uncover the underlying problem on the Swift (Rackspace) side.

That it could be either of these two scenarios (remote inquiry or download from the problem node) each retry explains why the file size reported in the logs was mysteriously flip flopping back and forth between remote and local across retries from time to time.

And in the scenario where it succeeded for the rclone transfer but failed our downstream content inspection would indicate that both the server side file object info check and the download request both involved the problem node - because the file that node was reporting on/serving was indeed old since the backend copy to that node was failing.

The overall situation included the scenario where the remote side info request and the download was from 2 of the nodes that were not experiencing the problem so successful transfers of the correct file were also mixed into the overall results, further making it confusing to understand at first.

Thanks for the help here @ncw to peel away the various layers of the problem. I was determined to not just "ignore" the checks and get to the crux of the problem especially since we had a repeating obvious problem to work with.

system · October 14, 2022, 4:51pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.