Compression during upload?

When using scripts such as h5ai it's possible to tar files and directories on the fly. It doesn't force you to wait for the tar to finish before the download starts either.

With rclone would it be possible to tar any files under 5M and create tars with max size of 200MB? Or something similar. While uploading on the fly so you don't have to make a duplicate on the host and use up storage space.

The other reason for this is Google's 1-2 file transfer limit which makes it take forever to archive anything not zipped or in a tar.

Yes, this would be very useful. The benefits of archiving together tons of small files is obvious just from doing it manually. 5M files aren't the ones that hurt the most, but when you go even smaller (which a surprising amount of files are) then it really kills performance badly.

A backend that could merge small files into a larger one would be great and I've had my mind on this idea for a long time. Ideally this would allow rclone to present the files "normally" and doing this job transparently for you.

Archiving the files would be a means to do this, and compression is often a good idea on small files as they tend be quite compressible, but technically it wouldn't be required I think - so maybe these tasks should be kept separate functions? As long as rclone knew where the individual files start in the large amalgamation you could just glue the bits together into a long file. Seeking is apparently quite fast - it's opening and closing files that takes a long time and has pretty hard restrictions on many services. So it should be relatively quick to pull out a handful of files you need from a big package.

I think the largest obstacle to this is how to handle the metadata needed smartly. Where files start and stop has to be saved somewhere, preferably on the remote itself. Do you use an index file pr "package", or some sort of centralized database file? In the niggty gritty details there are a lot of pros and cons you have to weigh out there. But is it possible? Almost certainly.

There is a "press remote" backend in the works currently that tackles the compression issue on it's own, but it does not currently deal with combining files (which is the even more important issue here I think). it would perhaps be a natural thing to extend this functionality to though.

If you have coding experience (rclone uses go, similar to C# and java) then be aware that anyone can help contribute to rclone's codebase. NCW will no doubt help you along as needed. He is always very enthusiastic about assistance :slight_smile:

@ncw This seems relevant for you I think, for future feature plans and such :smiley:

1 Like

That sounds good. The only issue would be how rclone handles duplicate files and checksums.

Well, that's one of several considerations you have to make :slight_smile:

One of the benefits of archiving is that you'd have baked-in metadata you could reference. If you "glued them together" you would not be able to have checksums for those individual files - although you'd have one for the large file..

For duplicate files I don't think that's the worst problem. The "index" (in whatever form you decide to make it) should contain the metadata for all the files in that bundle, so rclone could just reference this while it otherwise scans through normally. The backend should be able to deliver this data down the chain transparently I would think.

The worst problems are most likely related to how you keep all this synced properly. As soon as you introduce a new data-structure like this it creates another layer of complexity where you have to be very careful. Especially if you want it to remain compatible for multiple concurrent users and the like...

Bumping this. I really need some way to backup 10TB+ directories with thousands/hundreds of thousands files without compressing the entire thing prior to upload.

Some kind of streaming compression that doesn't save on the hostnode and only saves on the remote node would be nice. Like h5ai. Rclone doesn't currently do this but are there any third party options that can be used with rclone?

To make it transparent I don't think there is any way around integrating into an rclone backend.

But of course it would be fairly easy to automate the compression before upload via a script. I think that is the best you could do as an external solution. The problem is you probably end up with a fairly "dumb" solution trying to do it all on your own, and it's a hassle trying to search or grab a single file somewhere.

I also feel like this is one of those "big hurdles" we should seek to overcome as it would remove one of the biggest limiting factors on many cloud-host services - ie. the API limits and/or file-access limits.

I don't think I have seen a concrete issue created for the whole "combine smaller files into a archive" idea, so I will leave myself a reminder here to try to get that done. I don't suspect this is something we are likely to see a quick&easy fix for, but I will at least do my best to get the ball rolling on this...

I mean I don't particularly know how streaming compression works and will look into it on my own, but the goal would be to avoid a 1:1 copy on the hostnode prior to upload since that thrashes the disks and gives me 1/3 the cluster performance... not to mention I don't have the room for a ~400TB copy on the cluster.

It's possible to do this one directory at a time and automate it with a script but that's a giant pita, especially if anything goes wrong.

I don't think that compressing on-the-fly is the main problem here. There are third-party solutions that does that. Presumably we could fit the compressed version of any single of these small files into memory before upload if we had to in any case, rendering that a largely moot point.

And as we said, compression itself may not even be the issue. The main limiter is the file-access (2-3files/sec on a Gdrive). So just gluing files together into a single larger one would overcome that spesific problem without any compression. The real problems are in the logistics of keeping track of it all (where archives might inherently be of use as they have baked-in indexes). It is absolutely a solvable problem, just more complex than you might initially expect :slight_smile:

Should you by chance be familiar with coding and willing to learn go basics (fairly simple, and similar to C# and java) then by all means try your hand at implementing something for this. The code is open and anyone can make contributions. NCW will help you out as needed and is always glad for the help.