S3 Deep Archive - general questions


Not using the template on this one since it isn’t a real problem yet.

I was wondering if anyone had experience using S3 Glacier Deep Archive.

I have about 800gb of data roughly 190,000 files in different nested folders basically family picture archive that I want to backup solely for disaster recovery. Data isn’t really changing and hardly ever touched. Only gets added to over time. I currently back it up locally in various different forms and using rclone to an offsite server (sftp) however since the data basically is never touched I am thinking about moving the offsite storage to Glacier Deep Archive to save some cost.

Now I am wondering:

  • would it make sense to tarball the folders before uploading instead of having all individual files
  • and if I tarball should I use rclone to encrypt the tarballs which will be 10-50gb each or should I encrypt these with some other tool (OpenSSL, PGP etc.) and then purely use rclone for the upload
  • any specific settings on the archive side that anyone would recommend

I have no experience with the deep archive and just from reading it should be more efficient having less files and larger files which lets me to believe that .tar.gz files would be better but I am unsure at this point.
I want to avoid going through all the trouble of changing my backup destination just to find out that at the end I am not really saving anything.

The issue with the sftp storage that I am currently having is that I am paying for about 5tb of storage but only really am using about 1.5 TB total between the data described above and computer backups. The next step down that is available to me is 1TB which isn’t enough so I am trying to offload the static data that is basically never touched.

Hope I am making sense. Would be great if anyone that has relevant experience to this use case can share some of their experiences with me.

definitely yes - substantial difference in your bill as AWS charges for transactions

does not matter really - whatever you prefer

What you are talking about - cold storage archive/backup is actually supported quite well by program called rustic (restic version in rust) - have a look here Any example using glacier? · rustic-rs/rustic · Discussion #692 · GitHub

It is purely CLI stuff but if this is not an issue then it is nice fit for this task. Otherwise rclone and tarball approach is perfectly valid too:)

i have been using aws s3 deep galcier for many years.
you need to limit the number of files.

100% correct.
i use 7z, which supports encryption.

fwiw, s3 provides client-side encryption.

sound like hetzner storagebox?

if the filenames are unique then there are ways to reduce the number of api calls and the need to list/traverse the bucket

imho, restic and rustic are terrbile solution for deep glacier.

Thank you for both of your replies!

Really, I had no idea. I will have to look into that. What tool you using s3cmd, boto …?

Yes, exactly, storagebox. 1TB is too small and 5TB is too big. It is not a bad deal but I am really paying way more than what I need.

The rest to be honest I think I have to read again for me to make more sense but guessing the zip or tar approach is what makes the most sense to me at the moment

once you have that, you can find ways to use it ;wink

i rent a cheap hetzner cloud vm, run emby on it.
the media files are in a rclone crypt stored in storagebox.
in addition, i also use storagebox as a cheap seconday repository for backup copy jobs from veeam backup and replication.

rclone supports SSE-C, no need for s3cmd/boto

rclone and restic cannot create a session token.
so i use boto3 to create a session token and feed that to rclone and restic.

Makes sense to me. I was more looking what I need to use for the client side encryption in s3. If I understand that correctly I wouldn’t need rclone or any other non s3 tool then but could just upload my unencrypted tarballs and use s3 client side encryption if I figure out how to use that, did I get that right?

that is right,
if you use rclone to upload a file using S3 encryption.
then, can download it using any s3 tool, such as s3cmd, boto, etc...
for a gui, i have use used filezilla and s3browser

bonus, with SSE-C. the key is not stored anywhere at aws, idrive, wasabi.

  1. can be as simple as rclone copy c:\path\to\files aws:bucket --s3-sse-customer-algorithm=AES256 --s3-sse-customer-key=12345678901234567890123456789012
    passing the key on the commandline is least secure.

  2. more secure, can put the key in the config file
    if someone can access the rclone config file, now they have the key. must encrypt the rclone config file.

type = s3
provider = AWS
access_key_id = xxx
secret_access_key = xxx
region = us-east-1
storage_class = STANDARD
sse_customer_algorithm = AES256
sse_customer_key = 12345678901234567890123456789012
  1. most more secure, using environment variables.
    since you mentioned boto, i assume that you are python.
    i use subprocess.run to run rclone.exe, and using environment variables for the
  • sse-c key
  • rclone encryption password

tho, with s3 remotes, i create them on the fly, do not use rclone config file.
and that makes it easy to use environment variable for temporary session token, which can be use with rclone, restic, etc...

Perfect, that helped a lot to understand it.
The SSE-C key can be anything? Any length? It’s basically a password not like an SHA key or something, correct?

Regarding the 3 methods, that all makes sense. I am by no means a python programmer but I can take code and adjust it and write simpler code. More of a hobby. I mentioned boto because I have done some things with s3 (Scaleway) before and I used boto there for certain things.
Environment variables definitely something I can do :slight_smile: .

Thank you!

That definitely feels like a little more advanced than what I might be able to do but that’s ok. I am fine with a pretty static remote and upload as long as it is encrypted.

So tarball data, then create remote and then use s3 encryption to upload unencrypted tarballs that way I can use any other standard s3 tool potentially even with gui later. Got it. Or at least I hope I got that right :slight_smile:

yes, you got that right

if you need something specific, i could share it from my python backup script.

take a read of https://min.io/docs/minio/container/administration/server-side-encryption/server-side-encryption-sse-c.html

i am not 100% sure, but the key is 32 alphanumeric ascii characters.
you pass to rclone using --s3-sse-customer-algorithm='12345678901234567890123456789012'

and a more secure way is to obfuscate the key using base64

echo -n '12345678901234567890123456789012' | base64

and pass that to rclone.

Perfect, thank you so much! This was all really good information!

Last question: anything specific on the bucket or archive side that I need to pay attention to or configure? I have pretty much left everything at default and not applied really any policies. Just set it to private (or non-public).

I might come back to this.

Thank you again so much for your help and the great explanations!

  1. i have a user policy named mfa.user that is assigned to every user.
    that requires the user to use MFA and to access buckets, need to create temporary sessions tokens.
{"Version": "2012-10-17",
 "Statement": [ {"Effect": "Deny", "Action": "s3:*", Resource": "*",
 "Condition": {"Bool": {"aws:MultiFactorAuthPresent": "false"}}}]}
  1. i use bucket policy to lock a bucket to that mfa user.
{ "Version": "2012-10-17",
	"Statement": [{"Effect": "Allow",
	"Principal": {"AWS": "arn:aws:iam::203752558759:user/en.keepass"},
	Action":     [ "s3:ListBucket","s3:GetObject","s3:PutObject", ], "Resource": ["arn:aws:s3:::en.keepass/*","arn:aws:s3:::en.keepass"]
  1. enable lifecycle polcies for buckets.
  2. have rclone use sse-c encryption, as i discussed up above

for the archive, does not it matters too much. i use .7z encrypted with a password.
since the source is photos, no point in compressing. would use archive format of 0 - store

fwiw, 800GiB and 190,000 files is not a lot of data.
might find it easier to get 1TiB from idrive or wasabi,
and then not bother with aws deep glacier, not bother using archives.

That is true, too. Still debating with myself about it.

The trouble is to learn all the policy stuff in s3 and set it up correctly. Not doing it enough or hardly ever. Then you forget again…

yeah, policies are complex,

if all you need is sse-c encryption, then no need for polices, etc.
just need two flags, as shown above example,

At this point that is almost all I need.
The only other thing is I would like for the AccessKey of my backup user (IAM) to only be able to access that one specific bucket that will contain the backups. Need to figure out, how to do that again.

Other than that, I totally appreciate all the helpful explanations.

Lastly are archive sizes of 50-100gb too big?

To bring the conversation back to rclone, do I need to tell rclone to use multipart-upload somehow?

I have set everything else up, got a bucket, configured it all in rclone, solved my policy topic at least in the basics, tar.gz archived my files and encrypted them with gpg so really all I need is to copy them to s3 but need to use multipart upload since single archive files are big.

Ok, further research, answered my own question:

Thank you very much again for all your help @asdffdsa !!!

1 Like

nice, that you got figured it all out.

All thanks to you! @asdffdsa

The speed is incredible I have to say! For archiving this really beats my other storage.

I've been using rclone to backup petabytes of data to Glacier Flexible Retrieval. at 800GB and 190,000 files your average file size is 4.2MB. Storing any file in Glacier requires a 32KB charge at S3 Standard pricing to hold metadata about the object.

190000 *32KB is roughly 6GB of data in S3 Standard + 800GB in S3 Glacier
It's going to cost you very roughly $1.50/year for the 32KB/file charge. The
800GB in Deep Glacier is going to cost you roughly $10.00/year.

==> I wouldn't tarball anything for $1.50/year. Just do a straight sync.
If you wanted to get fancy, you could upload to S3 standard and have AWS lifecycle objects that are only >32KB to Glacier. But your total cost is so small, seems hardly worth the effort to optimize that.

==> Encryption. AWS will encrypt your objects at rest, but if you don't want anyone from AWS to be able read your file contents, just set up a crypt endpoint using that feature of rclone. It will crypt the file before uploading. Just don't lose your passphrase.

Just know that if you want something out of glacier - you have to first restore it (12-24 hours). Only then can you read the contents of the object.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.