Not sure I'm using --no-gzip-encoding correctly

PeterA · March 11, 2020, 6:26pm

What is the problem you are having with rclone?

I am trying to download content from Google Cloud Storage that is stored as compressed .gz files with Content-Encoding:gzip

What is your rclone version (output from `rclone version`)

rclone v1.51.0

Which OS you are using and how many bits (eg Windows 7, 64 bit)

OSX 10.15.3, 64 bit

Which cloud storage system are you using? (eg Google Drive)

Google Cloud Storage

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone copy gs:[bucket]/[path]/ ~/Desktop/test -vv --no-gzip-encoding
rclone sync gs:[bucket]/[path]/ ~/Desktop/test -vv --no-gzip-encoding

A log from the command with the `-vv` flag (eg output from `rclone -vv copy /tmp remote:tmp`)

2020/03/11 11:07:48 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "gs://" "/Users/peter/Desktop/test" "-vv" "--no-gzip-encoding"]
2020/03/11 11:07:48 DEBUG : Using config file from "/Users/peter/.config/rclone/rclone.conf"
2020/03/11 11:07:50 INFO : Local file system at /Users/peter/Desktop/test: Waiting for checks to finish
2020/03/11 11:07:50 INFO : Local file system at /Users/peter/Desktop/test: Waiting for transfers to finish
2020/03/11 11:07:50 ERROR : [file].csv.gz: corrupted on transfer: sizes differ 35952 vs 509855
2020/03/11 11:07:50 INFO : [file].csv.gz: Removing failed copy

If I add the --ignore-size flag, then the md5 check will fail. If I add the --ignore-checksum flag, then the download succeeds, but if I change my destination from my local drive to an S3 bucket, it fails again. It is my understanding that the --no-gzip-encoding flag should download the compressed file as is, without decompressing on the fly, and therefore the size and md5 should match, but it's possible I am misunderstanding that.

I also checked the file downloaded when I used both --ignore-size and --ignore-checksum flags (as well as --no-gzip-encoding), and while it has the .gz extension, if I remove it and just have it as a .csv, it opens fine in a text editor/excel/etc. So it feels like it's been uncompressed during the download.

My ultimate goal is to sync a Google Cloud Storage bucket with an S3 bucket as is. I know rclone is contemplating a -z or -Z flag that would simplify this process, but until then I'm trying to see if I can do as much as I can with rclone and add the extra steps manually.

ncw · March 12, 2020, 10:59am

How does it fail - is it the upload the S3 which fails?

I think that is correct, using the --no-gzip-encoding flag should cause rclone not to decompress incoming gzip files which should cause the size and md5sum to match.

That obviously isn't working though...

Can you remind me how you upload a gzip encoded file to gcs? I'll have a go locally.

ncw · March 12, 2020, 12:06pm

I found I'd made a patch for this already here

https://beta.rclone.org/branch/v1.51.0-096-g4f467f45-fix-2658-gcs-gzip-beta/ (uploaded in 15-30 mins)

can you give that a go? It should enable files with gzip encoding to be downloaded without being decompressed.

See this issue here: https://github.com/rclone/rclone/issues/2658

I think that patch is safe to merge because it checks the content encoding first so files wouldn't be downloading properly now.

I think however if you upload them to s3 then you'll need to manually set the content-encoding somehow.

What do you think?

PeterA · March 12, 2020, 3:07pm

The upload fails with the following error:

2020/03/12 07:48:58 DEBUG : pacer: low level retry 1/1 (error Put https://[bucket].s3.us-west-2.amazonaws.com/[file]?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=[credential]%2F20200312%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20200312T144858Z&X-Amz-Expires=900&X-Amz-SignedHeaders=content-md5%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-mtime&X-Amz-Signature=[signature]: net/http: HTTP/1.x transport connection broken: http: ContentLength=14310 with Body length 63426)
2020/03/12 07:48:58 DEBUG : pacer: Rate limited, increasing sleep to 2s

It looks like even if the size is ignored on download, it is not ignored on upload.

I gave this beta a try and it worked as you said. Files were successfully transferred from Google Cloud to local with matching MD5s, and also successfully to S3 (although they lacked the Encoding Type metadata as you said they would). This gets me 90% of the way to where I'm trying to get to; I'll look into a script that can set the Encoding Type on the files in S3 once uploaded.

I'm looking to do this as part of an automated process, is there a release with this patch expected soon and/or a way to install this beta version of rclone without manually downloading a .deb file?

Thanks again

ncw · March 12, 2020, 3:59pm

Ah, unfortunately AWS needs to know the size of the file before upload and rclone tells it the wrong size.

Great

I haven't merged this patch yet, as I still am not 100% sure it is the right thing to do.

Maybe we could discuss it.

Currently a file stored with Content-Encoding: gzip can't easily be downloaded.

I guess the choices are

rclone downloads it compressed as per the patch
- this works well for syncing as sizes and md5sums are correct
- this is perhaps confusing for the user if they download file.txt and try to open it and it is compressed
rclone downloads the file and decompresses it on the fly
- it checks the md5sum and size of the compressed file somehow (this is likely quite hard and not at all now rclone works at the moment and wouldn't work with rclone mount because we need to know the size of the original file in the directory listing.
- perhaps what the user expects
rclone could advertise the size of the file as -1 which means indeterminate and the md5sum as empty and decompress the file on download
- this will work fine for syncing
- this won't work terribly well for rclone mount
- this produces uncompressed files which the user is expecting.
- the md5sum doesn't get checked (though potentially the backend could fill in the md5sum of the uncompressed data and the size once it is known).

So 1. is much more natural for rclone but 2. might be more natural for the user. 3 might be OK too and would be easy to implement.

Maybe rclone should do 1 or 3 depending on a flag - say --no-decompress

Perhaps 3 should be the default (principle of least suprise).

Option 3 would fit in with a -z flag (which would cause rclone to gzip data as it uploads it and set the Content-Encoding).

You could then do your sync from gcs to s3 either with

rclone sync gcs:bucket s3:bucket -z
rclone sync gcs:bucket s3:bucket --no-compress

The first would recompress the data, the second wouldn't (but doesn't set the Content-Encoding).

I am trying to download content from Google Cloud Storage that is stored as compressed .gz files with Content-Encoding:gzip

So your .gz files are stored with Content-Encoding:gzip which will mean that they get decompressed by most browsers when they are downloaded. Is that correct? For these option 1 would be perfect but I don't think that most uses of Content-Encoding: gzip add a .gz to the file name.

Thoughts?

PeterA · March 12, 2020, 4:12pm

I can't speak for other use cases that want to serve compressed data directly to browsers, but our use case is that we have a client who dumps data into a Google Cloud bucket and we want to sync that bucket with a bucket in AWS that's under our control. We'd like our bucket to mirror theirs exactly, so having the files in the destination be exactly the same size, md5, compression, etc as the files in the source is what's desired for us. It looks like the -z flag would accomplish that.

ncw · March 12, 2020, 5:26pm

Thanks for the description of the use case - very helpful. Rclone should definitely support this as it is a "sync" operation.

So I think you'd want --no-decompress and that flag to set the Content-Encoding (that is the bit I haven't worked out yet!).

Which do you think should be the default?

rclone decompressing Content-Encoding: gzip files automatically
or not...

PeterA · March 12, 2020, 7:28pm

I can only speak for myself, but when I started using rclone (and gsutil before that) I expected commands like copy/cp and sync/rsync to transfer files unaltered. Decompressing something with a Content-Encoding feels more like the consequence of a "download" or "fetch" than a "copy" or a "sync."

But then others may have a different use case from I and come at the question from a different perspective.

asdffdsa · March 12, 2020, 7:32pm

that is a good point.
imho, rclone should never modify the contents of a file, unless there a specific flag requesting it.

ncw · March 13, 2020, 2:58pm

Ok that makes sense, so that would make the default to download the compressed blob, and an alternate flag to decompress

asdffdsa · March 13, 2020, 3:04pm

yes, the goal should always be, not to modify a file, unless requested with a flag.

thanks,

system · May 13, 2020, 11:04am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.