Can anyone explain gzip to me please? Does gzip solve all my problems where my backups full of full file path structures and tiny program files are lazily just thrown at a cloud storage solution? And does gzip work with amazon s3?

I first read about gzip here:

But there is no mention of how to force gzip to occur, merely a flag allowing it to occur? does this mean most s3 backends do not support gzip?

Why do I care? Again I’ve been investing how S3 Intelligent-Tiering: Works hoping to solve my problem. And one issue I’ve noticed is that objects smaller than 128kb are treated as objects sized at 128kb

This is bad for me as some of my backups are proper normal disk image backups of my OS, I have a lot of other much lazier backups of let’s say for example the state of python on my machine on some random day. Many 1000s of files just uploaded loose because google drive was spoiling me by not forcing me to make proper backups :frowning:

A gzip might solve this issue. I get the g in gzip likely stands for google. I read Google Cloud Storage but from what I can see… Overview of Storage Intelligence  |  Google Cloud Documentation offers automated lifecycle procedures but it offers nothing at all like amazon s3’s S3 Intelligent-Tiering where archived data can have hot metadata…

So, is it possible to force amazon s3 to use gzip? And is it possible to peak inside a gzip for say a list of files inside the gzip without downloading or otherwise fully accessing the gzip? In fact I swear I read that reading the file list of a gzip is a single HEAD api request? Rather than the potentially 1000-10000 requests that would be made for the file folder structure? Not to mention that S3 Intelligent-Tiering treats files smaller than 128kb not only as 128kb but keeps them hot and refuses to let them into cold storage.

In summary: Does gzip solve all my problems where my backups full of full file path structures and tiny program files are lazily just thrown at a cloud storage solution?

I am guessing no? If so what do I do? What’s the correct way to use s3 objects to backup all my data? I use proper full disk images for my OS, but I keep my OS partition small. I have large media files, large program files, and tiny program files. I also care almost MORE about the file path folder structure and date size name metadata than I do the data itself.

It feels like this makes s3 a poor fit for me. But only cold storage fits my needs, I have millions of files and 100TB that I do not expect to EVER want to access AT ALL. BUT I do need access to it’s metadata to avoid adding duplicates to the collection. Making a disk image of every single folder to maintain the file path structure sounds insane. It also sounds like it would make accessing the metadata impossible. Gzip sounds like the best of both worlds? But. I really need to find a way to have hot metadata and cold data data. Which again is what S3 Intelligent-Tiering offers. It’s just that I am not sure my data is in a format that can work with S3 Intelligent-Tiering at all.

TLTLDR: Does gzip solve all my inept fumbling problems? Does gzip work with amazon S3 Intelligent-Tiering? I assume no. In which case what can anyone recommend?

I originally dreamed up my problems 3 years ago, but I had heard of google cloud storage and I believe before rclone supported gzip?

https://forum.rclone.org/t/i-think-rclone-needs-some-sort-of-inverted-chunker-style-overlay/40550/7

I am wondering, if gzip will not work for my purposes at all. Maybe restic will? The issue is even if restic can group small files into bigger chunks to better fit s3 storage…. what happens to the metadata of those small files with restic?

Then there is rclone’s own compress features. Compress but it still marked as experimental. It has been experimental as far as I know for a couple years, so I am not sure it is an ideal fix for me either. Nor how will rclone’s own compress would achieve my goals for s3. I could try to compress every folder, BUT rclone compress appears to be about compressing files. NOT about fake compressing folders to generate metadata files for those folders, so that the folders metadata can be labeled hot and the folder’s contents can be labeled cold.

Compress does allow zero gzip compression which I assume is what I want and good for speed “- 0 — turns off compression.”

But again, what well… I’ll just ask it.

Compress would become TOTALLY FLAWLESS and AMAZING if I could tell it to compress every folder, place the compressed folder into cold s3 storage and then place the metadata file into hot s3 storage. That would work like a dream. It’s something I’ve wanted for years. But it seems slightly out of reach for me right now.

Again unless anyone knows what I mean and already has this solution working for themselves?

i think you are confused about a few things.


gzip = GNU zip


gzip is a program, has nothing to to with amazon, cloud, etc...


correct.
i have been using deep glacier for 7+ years.
I only upload large backups files that i plan to never access except for a nightmare emergency.

for example, i upload veeam backup and replication files, 7zip files, etc.


yes, rclone mount the remote and run gizp on the files inside the mount
but that is not going to work with glacier, deep glacier.
and pay attention to https://rclone.org/s3/#reducing-costs

I don’t wish to use deep glacier but instead intelligent tiering.

Which has it’s own deep archive tier which is a tiny bit like glacier but not quite like glacier. It gives most of the cost savings of glacier though.

Quote:

* S3 Intelligent-Tiering can store objects smaller than 128 KB, but auto-tiering has a minimum eligible object size of 128 KB. These smaller objects will not be monitored and will always be charged at the Frequent Access tier rates, with no monitoring and automation charge. For each object archived to the Archive Access tier or Deep Archive Access tier in S3 Intelligent-Tiering, Amazon S3 uses 8 KB of storage for the name of the object and other metadata (billed at S3 Standard storage rates) and 32 KB of storage for index and related metadata (billed at S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage rates).

In other words I want to get a bunch of large files into “S3 Intelligent-Tiering”’s “Deep Archive Access Tier” which will then be billed at “S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage rates).”

At the same time “For each object archived to the Archive Access tier or Deep Archive Access tier in S3 Intelligent-Tiering, Amazon S3 uses 8 KB of storage for the name of the object and other metadata (billed at S3 Standard storage rates)”

So I will have a lot of large files in deep archive, a near glacier like state. And each of these large files will generate a metadata file stored at Amazon S3 HOT storage prices. But these will all be small metadata files.

PROBLEM: None of my data is sorted this way! And I have no idea what to do! My large files are almost randomly mixed together with millions of tiny files. I need someway to preserve my file folder structure.

TLDR: My data is organized as if I am utterly insane. Why? Google drive enterprise made me lazy! It sounds like MAYBE GZIP could solve my problem, MAYBE rclone compress could solve my problem and MAYBE S3 Intelligent-Tiering could solve my problem. But I actually have to make all of my nonsense fit together and work.

In summation: It sounds to me like gzip would fix my mixture of small and large files, but it would entirely clash with S3 Intelligent-Tiering because the metadata intelligent tiering creates will be exterior metadata for the gzip, rather than any sort of natural gzip metadata?

That makes me wonder though where is all this gzip metadata stored? Does rclone create a paired metadata file for every gzip file? If that’s how it works perhaps I could upload all my stuff to S3 and then lifecycle all the gzip files into archive tier storage and leave all the gzip metadata files in full price HOT S3 storage? That would work? Right? But maybe that is not how gzip works? Because what I am describing is how Compress works.

Googling how gzip works it feels to me like it will NOT AT ALL work for me.

Gzip retrieves metadata without fully opening the entire compressed stream by reading a specific, fixed-size header and footer that contain crucial info like the original filename, timestamp, and checksum, plus blocks of optional data, all placed before the main compressed data and after the compressed content, allowing tools to peek at this info without decompressing the whole payload using specialized utilities like……

This means if the gzip is placed in glacier the metadata goes with it. Also the automatic S3 Intelligent-Tiering metadata file that is 8kb and left in hot storage will NOT AT ALL contain the files or folder structure inside the gzip, instead it will only contain information about the gzip file itself.

As such I think gzip will not at all work with S3 Intelligent-Tiering.

DOH.

Compress might work well with S3 Intelligent-Tiering or even unintelligent tiering. I would need to be able to lifecycle all the compressed files into archive tier and leave all the small .gz files in hot storage. But the problem is, I think rclone compress compresses files not folders? Unless I am mistaken? Meaning the .json files wouldn’t contain folder structures, it would merely contain the normal metadata S3 Intelligent-Tiering already offers?

TLDR: It feels like neither gzip nor compress can fix my problem. I am now curious about how other people solve this issue.

I have ruled out magically solving my problem. I now merely wish to emulate you.

@asdffdsa You place large backups into deep glacier archive type storage. You use backup and replication files (which I assume are some form of full disk image like my own synology backups?) You also mention 7zip files, and etc.

My issue with all of this is. I want a way to run let’s say a full rclone lsl command on backups I never wish to retrieve. In order to do this I would need to somehow have all my metadata in s3 hot storage. At the same time I had all my actual data in something like glacier. Have you achieved anything like this? Is veeam keeping track of the metadata for you on it’s own somehow?

Does 7zip offer a feature to solve this problem? For example does the 7zip program offer a method to export all the metadata for the insides of a 7zip file into some sort of readable metadata file? I’d love to jam a deep archive full of gigantic 7zip files and export all their metadata into some rclone readable format and then dump that metadata collection into full price HOT s3 storage.

It seems like maybe my desires are impossible though? Do you get what I am driving at though at least? Or have any ideas? In fact. Actually. One sec.

I do of course have plaintext files. About once a year I tell rclone to do a full lsl to text file dump from google drive enterprise to my local harddrive. My last lsl text export is 900megabytes. But I can’t feed this text file back into rclone. I can’t browse this text file with rclone ncdu. I also cannot use this text file when uploading new files to avoid making cloud duplicates.

In other words, what I want should be possible, because a basic commandline output feeding into a text file ALMOST does the job. The problem is that as far as I know there is no way to feed rclone false metadata.