Gzip for Google Cloud Storage

pgalbavy · June 25, 2020, 6:58am

I know there was discussion of this in the past, but before I get too deep into something I know little about (the rclone source) has there been any conclusion on adding support for gzip transcoding upload for GCS?

For those wondering what;

This is documented here: https://cloud.google.com/storage/docs/transcoding

In summary if you pass the -z "exts" or -Z to gsutil cp then files are stream gzipped and the Content-Encoding is set to gzip. This allows pre-compression for storage and also compressed HTTP serving of files from a static website.

I just tested this with rclone manually:

gzip index.html
mv index.html.gz index.html
rclone copy ./index.html gcs:www.literature.org/test/ --header-upload "Content-Encoding: gzip"

shows as 2k instead of 5k in the browser. Then

rm index.html
wget https://www.literature.org/test/index.html
mv index.html index.html.gz
gunzip index.html.gz

and the file is intact. Accessing the page in a browser works as expected.

My website is - as should be obvious from the above - served this way and one of the analytics suggestions was to gzip files as then they serve them with optional compression to the client-side and all processing is client side and there are a couple of mid-sized JS files which could do with compression (only 16k for the main one but still...)

So, at this point I think the "plumbing" is more about processing options and applying gzip stream compression to files that match a pattern (I would think using rclone include/exclude syntax rather than using googles own extension syntax) or everything, but leaving other metadata (except Content-Length, which I'm not sure is required) alone?

The other question is what flag name(s) to use?

I've started looking at the codebase, but what may take me a week may take someone else 1/2 hour...

Peter

pgalbavy · June 25, 2020, 10:14am

This is my really horrible and nasty quick proof of concept. It worked for exactly one file so far; I had to ignore the size and hash checks, just to test - but I would guess there is a better way of knowingly ignoring these checks in real life?

PS I am not a Golang expert by any means. There are probably "right" ways of doing some of this.

diff --git a/backend/googlecloudstorage/googlecloudstorage.go b/backend/googlecloudstorage/googlecloudstorage.go
index 1ba2a945c..82bc91a4c 100644
--- a/backend/googlecloudstorage/googlecloudstorage.go
+++ b/backend/googlecloudstorage/googlecloudstorage.go
@@ -13,6 +13,7 @@ FIXME Patch/Delete/Get isn't working with files with spaces in - giving 404 erro
 */

 import (
+       "compress/gzip"
        "context"
        "encoding/base64"
        "encoding/hex"
@@ -1093,12 +1094,28 @@ func (o *Object) Update(ctx context.Context, in io.Reader, src fs.ObjectInfo, op
                        }
                }
        }
+
        var newObject *storage.Object
        err = o.fs.pacer.CallNoRetry(func() (bool, error) {
-               insertObject := o.fs.svc.Objects.Insert(bucket, &object).Media(in, googleapi.ContentType("")).Name(object.Name)
+               // hack to test gzip
+               object.ContentEncoding = "gzip"
+               gr, gw := io.Pipe()
+
+               go func() {
+                       defer gw.Close()
+                       gz := gzip.NewWriter(gw)
+                       if _, err := io.Copy(gz, in); err != nil {
+                               fs.Errorf(o, "gzip pipe failed: %q", err)
+                       } else {
+                       }
+                       gz.Flush()
+               }()
+
+               insertObject := o.fs.svc.Objects.Insert(bucket, &object).Media(gr, googleapi.ContentType("")).Name(object.Name)
                if !o.fs.opt.BucketPolicyOnly {
                        insertObject.PredefinedAcl(o.fs.opt.ObjectACL)
                }
+
                newObject, err = insertObject.Context(ctx).Do()
                return shouldRetry(err)
        })
diff --git a/fs/operations/operations.go b/fs/operations/operations.go
index fe623df5a..408c750f1 100644
--- a/fs/operations/operations.go
+++ b/fs/operations/operations.go
@@ -468,7 +468,7 @@ func Copy(ctx context.Context, f fs.Fs, dst fs.Object, remote string, src fs.Obj
        }

        // Verify sizes are the same after transfer
-       if sizeDiffers(src, dst) {
+       if false && sizeDiffers(src, dst) {
                err = errors.Errorf("corrupted on transfer: sizes differ %d vs %d", src.Size(), dst.Size())
                fs.Errorf(dst, "%v", err)
                err = fs.CountError(err)
@@ -477,7 +477,7 @@ func Copy(ctx context.Context, f fs.Fs, dst fs.Object, remote string, src fs.Obj
        }

        // Verify hashes are the same after transfer - ignoring blank hashes
-       if hashType != hash.None {
+       if false && hashType != hash.None {
                // checkHashes has logged and counted errors
                equal, _, srcSum, dstSum, _ := checkHashes(ctx, src, dst, hashType)
                if !equal {

pgalbavy · June 25, 2020, 10:39am

Seems to "work" and the website works too.

$ rclone size gcs:test.literature.org
Total objects: 12499
Total size: 80.313 MBytes (84214128 Bytes)

$ rclone size gcs:www.literature.org
Total objects: 12499
Total size: 190.324 MBytes (199568981 Bytes)

Now, when spare time allows later I shall be looking at how normal, real rclone code would allow for any of this - and how to use command line options too. I am thinking that this could well be part of the remote config and not just per-run.

ncw · June 25, 2020, 9:45pm

Great demo

There are issues for this

the suggestions to implement of a "-z <file extensions>" and a "-Z" flags · Issue #3205 · rclone/rclone · GitHub
Google Cloud Storage: Can't download files with Content-Encoding: gzip · Issue #2658 · rclone/rclone · GitHub

I haven't implemented it as I've been unable to quite get straight in my head what the -z flag really does for uploads and downloads

I made a branch for making downloads of these compressed files work. Without a flag you get the compressed data. Maybe with the -z flag it should uncompress it for you? However that breaks the length and the checksum checks rclone does, so maybe the -z flag should only apply to uploads.

https://beta.rclone.org/branch/v1.52.2-133-gee28856f-fix-2658-gcs-gzip-beta/ (uploaded in 15-30 mins)

Stream of consciousness ramble incoming:

Should -z be interpreted by the backend or the frontend? If it is interpreted by the backend then each backend needs to implement it, however if it is interpreted by the frontend then we only need to do it once. We'd need to tell the backend that they needed to apply a Content-Encoding: gzip - there is a mechanism for this already to add headers.

So applying a -z flag on the frontend seems like a plan. This would need to happen in operations.Copy. It would probably need to call something like operations.Rcat but with a bit of extra Gzip as the Gzip would make it a file of unknown size.

If you want to sync with the -z flag then any file which is elegible for compression effectively has an unknown size and hash.

That makes me think the way to implment this might be at a higher level still, at the directory listing level. So if -z is in effect then files in the source directory listing get wrapped in an Object which

returns a size of -1 to show the size is unknown
returns empty hashes to show the hashes are unknown
when read returns the gzipped data rather than the plain data

These objects would go into operations.Copy and do the right thing, though Copy would need to inject the Content-Encoding header if it found one.

Syncing would then work.

That sounds like it might work.

Are we happy that -z does nothing on downloading files? That seems a bit asymmetric to me.

If the -z flag is general purpose then we could use it on Google drive. Users would probably expect that files downloaded from google drive were decompressed. However google drive doesn't support Content-Encoding...

As described above if you used the -z flag copying stuff from GCS then it would compress stuff, so I think we need a different flag for decompressing stuff - argh!

So maybe we are leading towards two flags --compress and --decompress with an optional mask --compress-only "*.{html,txt}" or something like that (rclone filter syntax).

Do you think we need the full might of rclone filters here? With includes, excludes etc?

What do you think --decompress should do if it is asked to decompress a file which isn't a gzip file - just print a warning and output the decompressed file? That would mean that you don't need the same include list when you --decompress (I'm assuming that in general the backend won't know if the file is compressed or not).

That scheme should work for backends which understand Content-Encoding: gzip and those which don't.

Apologies for the info dump, I hope some of it makes sense! I'd be interested to hear your comments.

ncw · June 25, 2020, 10:15pm

Another idea... Perhaps on backends which don't support content encoding we modify the filename to end in .gz. Then we would know for definite on all backends which files are compressed or not.

pgalbavy · June 26, 2020, 6:45am

In reply to your stream of conciousness, I'll try some too - but with only just starting on my first cup of coffee!

This would have been easy had the HTTP standard also allowed an Accept-Encoding header to be sent by the server before requests, but we are where we are.

The Google document discusses the issues in general quite well and talks about their behaviour for downloads. They also push the terms "compressive" and "decompressive" transcoding quite forcefully, but in a good way. Perhaps we should pick up those terms?

My exposure to other backends - apart from S3, ssh and some Dropbox - is minimal so I have no idea how and what other storage engines may do related to this.

I like the idea of making this a general option in the front-end and since I posted I have had a reasonable explore of operations.go and the Copy / Update / Put / PutStream functions. Like you say the questions are around what to do when something is supported and not.

My principal need if for better supporting static web pages hosted in GCS and this will not be of any use - in fact will be of negative use - to those storing pre-processed data (audio, video, zip, tar.gz, encrypted etc. ) and while I like Google's extension selector I think it could go further. The obvious option is a size selector - only apply to uncompressed files between N and M bytes in size. This could avoid trying to compress files bigger than a typical text or html file or perhaps only try on a file larger than a minimal size (100 byte text files would be pointless, mostly, as serving those would fit in a single TCP packet on most networks anyway and decompression is just overhead) .

My problem is that the rclone codebase is huge and, while tidy, very convoluted and layered. For example I see the accounting layer counting the original file size for the transfer but as of now I have no idea how to tweak this to use the compressed size. That's just one example, not an explicit request for guidance just now!

Hashes are out the window - for Google it's explicit in their documentation - but size may still work IFF you precompress the file (in memory below a specific size, tmp file if over?) and the code used to implement algo used to do the compression produces deterministic results. My tendency would be to turn off hash checking for files that will be compressed and consider size comparisons, but how does that work for "sync" instead of just copy?

What happens - and so far I have only considered uploads - when you try to upload compressed file over uncompressed variants? Can we / should we check the remote metadata (content-encoring: none + content-lemgth and maybe even a hash when available) before deciding if the file has been updated or is the fact that the storage method is different enough to force a copy? Is there any circumstance in which we would NOT overwrite an uncompressed file with a compressed one? And the reverse, if you want to replace compressed files with raw versions later on?

What happens on remote to remote moves? No point decompressing and recompressing if both backends support this is there? Just treat the files (and metadata) as constant?

If you have a pre-compressed file - perhaps from an earlier download, then is it enough for the mimetype check returning application/gzip to set Content-Encoding ? But then do we have to in-turn look inside the file to set the real Content-Type ? But only when this is enabled. This also adds extra work for sending to a compressive backend without decompress/recompress but then how do we check file type?

In terms of implementation details I would suggest something like:

-z / --compressive - turn this on for uploads, turn off hash and size checks for all matching files

compressive implies upload, as per google docs, but what is an upload?
for "downloads" this should leave files untouched (compressed, if they are) and un-renamed, caveat emptor - content-type/content-encoding will then be fun

--compressive-include - use the same logic as --include to match files
--compressive-exclude - ditto
--... and the same for loading from files, just like include/exclude options

--size-only forces pre-compression and comparison (but do we then ignore modtimes? I think so)

I'm running out of coffee...

ncw · June 26, 2020, 9:02am

s3/azureblob/swift/qingstor/b2 will support Content-Encoding in exactly the same way as GCS. I don't think any of the other backends support it directly as they aren't expecting to serve the files over http.

If we needed to know in advance whether a backend supports Content-Encoding then I'd make a feature flag.

Rclone has the filtering module which supports anything you could possibly want. It is rather a lot of flags but each of these could have a prefix (say "compressive" though not sure I like that name!).

That would require speculatively compressing the file just to read the size which I think would be rather a CPU drag even for small files. Note that files might be coming over the network too.

There are 3 attributes used when syncing or copying

size
hash
modtime

Rclone can do a sync with any combination of those. If we can't use size and hash then modtime will work pretty well.

Assuming the modtime does not match then rclone will overwrite it. If the modtime does match then you'd need an additional flag to force the overwrite.

I'd leave those decisions entirely to rclone's existing sync methods. (note that copy is essentially the same operation as sync, it just doesn't delete spare files on the destination).

Yes, just move the compressed data. I think that would "just work".

One think I would like to work is that you can sync your s3 bucket to your gcs bucket and the content-encoding is synced too. That needs a bit more work.

I think the principle that without the compress flags, the content-encoding is just another bit of metadata is a good one. Stopping the gcs (and others) doing decompressive transcoding in this case is necessary, but I think that is the right decision.

Google say explicitly in their docs that if the file has application/gzip you should not set Content-Encoding. If you do then you are signalling that it is double compressed.

Rclone has a precise definition of upload it is a call to the Put or Update methods of the backend.

OK

It occurs to me that maybe what we are describing here is a new backend... In rclone backends can wrap other backends and intercept all the listing, uploading, downloading calls.

It would be straight forward to make a wrapping backend which implements this. No patches to rclone internals needed.

Configuring the backend is a little ugly as in you have to make the backend then point its remote parameter at the backend you want compressed.

This could potentially be configured by the --compressive flag (say) with some sensible defaults - this would make a backend on the fly and wrap it around the destination.

Having a backend like this would be very useful, but it wouldn't be able to seek in compressed files easily which is a property all backends have at the moment.

I think writing a new backend would be the easiest way of experimenting with this and gives a nice place to put the Filter necessary.

I'd suggest the name gzip for the backend.

If you want to look at a backend which works like this then check out the crypt backend.

It does make it more complex for the user to configure but it does fit much better in to the rclone framework...

I note also that we have the press backend in development. This has pluggable compression methods and would support sizes of files and seeking - it stores additional metadata for this and compresses the streams in chunks. I'd propose that the gzip backend wouldn't do this.

system · August 26, 2020, 5:02am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.