Compression during upload?

Paradox55 · October 17, 2019, 6:26pm

When using scripts such as h5ai it's possible to tar files and directories on the fly. It doesn't force you to wait for the tar to finish before the download starts either.

With rclone would it be possible to tar any files under 5M and create tars with max size of 200MB? Or something similar. While uploading on the fly so you don't have to make a duplicate on the host and use up storage space.

The other reason for this is Google's 1-2 file transfer limit which makes it take forever to archive anything not zipped or in a tar.

thestigma · October 17, 2019, 6:45pm

Yes, this would be very useful. The benefits of archiving together tons of small files is obvious just from doing it manually. 5M files aren't the ones that hurt the most, but when you go even smaller (which a surprising amount of files are) then it really kills performance badly.

A backend that could merge small files into a larger one would be great and I've had my mind on this idea for a long time. Ideally this would allow rclone to present the files "normally" and doing this job transparently for you.

Archiving the files would be a means to do this, and compression is often a good idea on small files as they tend be quite compressible, but technically it wouldn't be required I think - so maybe these tasks should be kept separate functions? As long as rclone knew where the individual files start in the large amalgamation you could just glue the bits together into a long file. Seeking is apparently quite fast - it's opening and closing files that takes a long time and has pretty hard restrictions on many services. So it should be relatively quick to pull out a handful of files you need from a big package.

I think the largest obstacle to this is how to handle the metadata needed smartly. Where files start and stop has to be saved somewhere, preferably on the remote itself. Do you use an index file pr "package", or some sort of centralized database file? In the niggty gritty details there are a lot of pros and cons you have to weigh out there. But is it possible? Almost certainly.

There is a "press remote" backend in the works currently that tackles the compression issue on it's own, but it does not currently deal with combining files (which is the even more important issue here I think). it would perhaps be a natural thing to extend this functionality to though.

If you have coding experience (rclone uses go, similar to C# and java) then be aware that anyone can help contribute to rclone's codebase. NCW will no doubt help you along as needed. He is always very enthusiastic about assistance

@ncw This seems relevant for you I think, for future feature plans and such

Paradox55 · October 17, 2019, 6:51pm

That sounds good. The only issue would be how rclone handles duplicate files and checksums.

thestigma · October 17, 2019, 6:59pm

Well, that's one of several considerations you have to make

One of the benefits of archiving is that you'd have baked-in metadata you could reference. If you "glued them together" you would not be able to have checksums for those individual files - although you'd have one for the large file..

For duplicate files I don't think that's the worst problem. The "index" (in whatever form you decide to make it) should contain the metadata for all the files in that bundle, so rclone could just reference this while it otherwise scans through normally. The backend should be able to deliver this data down the chain transparently I would think.

The worst problems are most likely related to how you keep all this synced properly. As soon as you introduce a new data-structure like this it creates another layer of complexity where you have to be very careful. Especially if you want it to remain compatible for multiple concurrent users and the like...

Paradox55 · October 19, 2019, 9:21pm

Bumping this. I really need some way to backup 10TB+ directories with thousands/hundreds of thousands files without compressing the entire thing prior to upload.

Some kind of streaming compression that doesn't save on the hostnode and only saves on the remote node would be nice. Like h5ai. Rclone doesn't currently do this but are there any third party options that can be used with rclone?

thestigma · October 19, 2019, 9:34pm

To make it transparent I don't think there is any way around integrating into an rclone backend.

But of course it would be fairly easy to automate the compression before upload via a script. I think that is the best you could do as an external solution. The problem is you probably end up with a fairly "dumb" solution trying to do it all on your own, and it's a hassle trying to search or grab a single file somewhere.

I also feel like this is one of those "big hurdles" we should seek to overcome as it would remove one of the biggest limiting factors on many cloud-host services - ie. the API limits and/or file-access limits.

I don't think I have seen a concrete issue created for the whole "combine smaller files into a archive" idea, so I will leave myself a reminder here to try to get that done. I don't suspect this is something we are likely to see a quick&easy fix for, but I will at least do my best to get the ball rolling on this...

Paradox55 · October 19, 2019, 9:36pm

I mean I don't particularly know how streaming compression works and will look into it on my own, but the goal would be to avoid a 1:1 copy on the hostnode prior to upload since that thrashes the disks and gives me 1/3 the cluster performance... not to mention I don't have the room for a ~400TB copy on the cluster.

It's possible to do this one directory at a time and automate it with a script but that's a giant pita, especially if anything goes wrong.

thestigma · October 19, 2019, 9:43pm

I don't think that compressing on-the-fly is the main problem here. There are third-party solutions that does that. Presumably we could fit the compressed version of any single of these small files into memory before upload if we had to in any case, rendering that a largely moot point.

And as we said, compression itself may not even be the issue. The main limiter is the file-access (2-3files/sec on a Gdrive). So just gluing files together into a single larger one would overcome that spesific problem without any compression. The real problems are in the logistics of keeping track of it all (where archives might inherently be of use as they have baked-in indexes). It is absolutely a solvable problem, just more complex than you might initially expect

thestigma · October 19, 2019, 9:46pm

Should you by chance be familiar with coding and willing to learn go basics (fairly simple, and similar to C# and java) then by all means try your hand at implementing something for this. The code is open and anyone can make contributions. NCW will help you out as needed and is always glad for the help.

cyprusking01 · December 6, 2019, 8:08pm

Even I have the exact similar doubt and it do no seems to get solved from a very long time,

Any help will be appreciated.

Thanks! vidmate shareit

asdffdsa · December 6, 2019, 10:54pm

i am curious as to how much you have to spend to have 10TB in google storage?
perhaps you could use another cloud provider that does not have all of google's many limitations in terms of daily upload limits and uploads per second limits?

Animosity022 · December 6, 2019, 11:22pm

Google really has few limits that any normal user hits.

I have 100TB on my Google Drive for $12 a month.

asdffdsa · December 6, 2019, 11:36pm

wow,
as per this https://one.google.com/storage, that does not seem possible?
please share?

Freds · December 6, 2019, 11:46pm

It's a Google business account (G Suite?). You're 'supposed' to have 5 users minimum to qualify for unlimited storage, but for now, Google doesn't seem to enforce any limits even for single users. With enough people storing 100TB+, that'll likely change sooner than later.

asdffdsa · December 7, 2019, 12:04am

thanks for that link.
so when google goes evil, well, it is already evil, i guess goes more evil and does what amazon and microsoft did, 100TB to be downloaded/transferred over 180 days, i would have to download .55TB per each and every day.
i guess if most of that storage is just fluff, like movies, tv shows and plex stuff, it would not matter too much to lose most of it.

Animosity022 · December 7, 2019, 1:24am

Even with 5 users for 100TB, it's still insanely cheap.

Harry · December 7, 2019, 2:33pm

Sometimes Google Drive do not let us download a file because the file has been downloaded many times. In that case does Google limit downloads on other files too? or Just that particular file?

Animosity022 · December 7, 2019, 4:09pm

That's really too broad of a question as there are a few items I know at least that play into answering that.

There is a 10TB daily download limit so if you go over that, you hit a quota cap for that 24 hour period
Sharing files or shared stuff has different limits per file download. I've never personally seen that posted for specifics as Google wants use to share video via YouTube rather than a Google Drive link.

It really depends. I've personally never been hit with quotas but my use case is Plex with 5-6 users and the most I see downloaded in a day in 200GB or so.

Harry · December 7, 2019, 4:15pm

Yes. Public links has different download cap. But I could not find any info regarding how they cap it.

thestigma · December 8, 2019, 5:24am

That is as far as I know largely unexplored. The best I have read on this issue is "hundreds of download in a few hours". This should be unlikely to occur in any personal use, and Gdrive is not meant for mass-distribution.

Let us not de-rail this topic further please. This thread is about upload compression. There is already a project in-the-works for this.