Is S3 with crypt and large files safe?

I was about to use S3 with crypt but I read this:

Note that files uploaded both with multipart upload and through crypt remotes do not have MD5 sums.
(here : https://rclone.org/s3/#multipart-uploads )

From what I understood without the use of crypt, the files sent with multipart have an MD5. I don't really understand why files sent through crypt to s3 with multipart upload (mandatory for files bigger than 5GB) do not have valid MD5.

Are there any risks to host large files through crypt on s3 even if these files do not have a MD5, like high chance of corrupting files ?

How local comparisons work without MD5 on those files when using the sync command for example ?

Thank you for your help

The reason is that when you use the crypt remote, files are encrypted as they are uploaded. And this is crucial, because of another feature of (good) encryption: encrypting the same data twice will yield different results.

In order to get the MD5, you need the full file. But this requires tons of caching (which can get...expensive), since encrypting the file once to calculate the MD5 and then encrypting it again in transit will lead to different MD5 checksums.

Without the crypt part of the equation, everything's fine because the file already exists — no caching necessary. Just calculate the MD5, upload, and compare.

Personally, I would get nervous if I uploaded files and couldn't validate them. However, that is not the case here!. You can use the cryptcheck command to verify uploads after you've done them. It will take a while, but it will do what you want (verify that the uploads were completed properly).

Thank you very much for your explanations!

Why do this "problem" only occur with multipart upload ?

If I understand correctly this is what rclone does when the upload is not sent with multipart upload, it encrypts the file only once (which is therefore saved in cache or memory) to send it and then compare the MD5s after the upload is complete but does not do it for files sent in multipart because the files sent in multipart are supposed to be larger?

So it's not directly related to the fact that the file is sent in multipart but rather because of its supposed large size that could create cache problem if I understood correctly?

Sending a 1GB file without multipart upload would cause no problem to validate those files but sending this same file with multipart upload cause no file validation possible by rclone?

Sorry for the inconvenience, I'm a complete newbie :frowning:

Good to know that there is a way to validate files upload, thank you :slight_smile:

I'm actually not quite sure why it specifies both multipart and crypt here, unless it means that there will be no MD5 checksum in the crypt case and in the multipart case (@ncw?).

For multipart, I know that this seems to be a similar issue with B2. I think in most of the multipart cases or "large file" cases, if the source supports checksums, the large file will be uploaded with a checksum. Otherwise, it won't. Since crypt doesn't support checksums, large files on S3 will not upload with checksums.

@ncw I'm kind of spit-balling here, and some confirmation would be nice :slight_smile:

For a multipart upload, rclone needs to know the MD5SUM in advance of the upload in order to add it to the metadata. Crypt can't supply MD5SUMs in advance (as it would have to read the whole file to get it). So multipart uploads with crypt don't have MD5SUMs.

I hope that makes sense!

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.