I think rclone needs some sort of inverted chunker style overlay

left1000 · August 2, 2023, 12:42am

I think rclone needs some sort of inverted chunker style overlay.

Chunker takes very large files and breaks them down into smaller files. But what about people who have tons of very small files and want an overlay that somehow merges them into larger files?

This sounds pointless, who would want to have to download a 1gb file in order to access a 1kb file?

But, well, if you're using amazon deep glacier you won't be making any file downloads hardly ever. But you might be requesting too many list/head api calls and overpaying for them. I know about --update --use-server-modtime and --s3-no-head but even getting 1000 items per 1 request using --fast-list the way my data is currently stored will generate far far too many api requests over time merely via the list api (at least that's my fear, and that's why I've never tried it before.)

I've been looking at amazon deep glacier, and it would be perfect for my usage case. EXCEPT I have too many files. I've made my backup's in a lazy fashion with million of tiny files just loose.

Since I started using google's cloud service I've only ever downloaded less than 0.1-1.0% of the data. But I am constantly checking metadata in order to update backups. I never use mount. I never use plex. I often use rclone copy and target a very large directory though. Like I said there are just too many files in my cloud storage right now. Too many small files generating too much metadata for amazon deep glacier's request policies.

Maybe I just need to be a better user of rclone. Obviously if I was smarter I'd have already set all this up. But in my imagination if chunker can exist, why not invertedchunker? rclone does so much for me, without me having to know what I'm doing, what's one more thing?

Edit: The problem would be how to access the metadata for all the files inside the archive/zip that's not a feature a standard archive.zip supports or has. A second file containing all the metadata for the archive would be needed. But in order to work on deep glacier this second file would need to be sorted on an alternate remote? Or some such. It's an almost impossible problem to solve I guess? The metadata files for each archive/invertechunk could be stored maybe on normal s3 and then the deep glacier chunks wouldn't need to be interacted with essentially EVER, you'd never use up those api requests, instead you'd just be constantly downloading small metadata files from S3? These metadata files would be small, and fit within S3's free bandwidth requirements.

But this would be the most complicated overlay layer ever coded for rclone because it would require two underlying remotes, one of which would be deep glacier, or any other api restrictive remote, even ones with solutions in rclone already in place (like box.com) and then a 2nd-ary remote with limited storage but less limited api requests?

Sorry if this makes no sense. I'm just trying to think of a solution that would make amazon deep glacier easier to use. I think this sort of solution is possible, but it's so far beyond me that maybe I'm describing it poorly. In essence I'd want an overlay that would convert lsd and lsl and checkbydate style commands targeted at amazon deep glacier into instead downloads of small files from some other remote that didn't restrict api calls at all, but did restrict storage. Heck even google drive instead of S3 could probably host all the tiny metadata files that would be created to pair with the inverted-chunker overlay.

left1000 · August 2, 2023, 1:31am

To further clarify the issue my 134tb google drive cloud storage contained files such that lsl --> to file left me with the text file fullloggoogledrivecrypt07112023 470megabytes.

But I could download 470megabytes every time I wanted to do an lsl or lsd or even copy (which checks target to avoid uploading duplicate files)

Any remote on earth could easily store 470 megabytes and allow frequent downloading and editing of that file. And in my version instead of 134tb broken into one 470 megabyte lsl metadata file it would be like chunker...

So maybe every 5gb would have 1 metadata file. So, my remote would have a total storage of around 500megabytes and 26,800 metadata files. This overlay would simply need to know which metadata files went with which invertedchunks on glacier... it would also have to find some smart way of breaking up an upload into nothing but new inverted chunks, since deep glacier invertedchunks couldn't be edited, only deleted or replaced.

Why do I have a plaintext copy of my lsl? Well because it's useful. But it'd be nice if rclone had some sort of unified slash integrated alternate metadata storage system.

Maybe instead of calling it an invertedchunker it could be called a metadatamirror? Take remoteA and remoteB and set it up so that all the metadata for remoteA is mirrored as a purely metadata storing file on remoteB

Then you could do like

rclone -v copy "localstuff" "mirrorinvertedchunkerwhateveryouwanttocallit"

It would then grab the metadata from remoteB to perform all it's --checkers against, then it would upload the data to remoteA and update all the metadata files on remoteB that were relevent.

ncw · August 2, 2023, 2:20am

I did make a zip backend (not released yet) which allows you to read individual files out of the zip archive without downloading the whole thing. Rclone downloads the central directory which is normally quite small then it knows where all the files are in the zip file.

That sounds like it could go part of the way towards solving your problem.

You could also use restic. Restic is good at chunking files up together - it makes lots of about 10MB files but they are compressed and encrypted. Restic has a fuse filing system to read individual files back. You can also use any of rclone's backends with restic.

This is kind of like a union with a cache. One day when I finish the vfscache backend you'll be able to wrap it over a remote and it would work very much like this. In the mean time you could investigate the cache backend which works in a similar way.

left1000 · August 2, 2023, 7:53pm

I was using the cache backend ages ago but it didn't work that effectively. I stopped using it when you said you weren't working on it anymore. I could just cross my fingers though that you finish vfscache and/or the peak inside zip folder features before the unlimited cloud storage scene entirely combust's and we're all forced to move somewhere like amazon deep glacier.

It sounds like you have a far better comprehension of the issue than I do though. I guess I just wanted to make sure someone with expertise was aware of how hard (feels virtually impossible) it would be for an inept user like me to set something up using amazon deep glacier's draconian restrictions (policies which make total sense as cost saving measures, of course.)

TLDR: As a novice it's easy to generate a million unnecessary api calls being foolish. Very easy. I do it constantly.

kapitainsky · August 2, 2023, 8:02pm

I think you are mixing things. Rclone is not backup software. If you want to use AWS Glacier you need some logic to manage its limitations. rclone can be helpful to upload or download files. But it is not a solution for backup purposes.

left1000 · August 2, 2023, 8:06pm

I do not need rclone for backup purposes I run weekly shadowclone copies on my computer to my synology nas. This has proven extremely reliable and effective. The cloud files are far less important things.

But the synology is so easy to use, that I haven't learned any advance backup lingo/programs/technology/habits/etc. Which is why I use rclone at such a total novice level.

The synology only keeps backup versions for at most two years though to save on space, and the cloud files go back 10-14 years. You can imagine quite a lot of these files are incredibly unlikely to ever be needed again, but it can be useful to access their metadata when comparing to my memory and modern files.

system · October 1, 2023, 8:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.