A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)
2020/03/11 11:07:48 DEBUG : rclone: Version "v1.51.0" starting with parameters ["rclone" "copy" "gs://" "/Users/peter/Desktop/test" "-vv" "--no-gzip-encoding"]
2020/03/11 11:07:48 DEBUG : Using config file from "/Users/peter/.config/rclone/rclone.conf"
2020/03/11 11:07:50 INFO : Local file system at /Users/peter/Desktop/test: Waiting for checks to finish
2020/03/11 11:07:50 INFO : Local file system at /Users/peter/Desktop/test: Waiting for transfers to finish
2020/03/11 11:07:50 ERROR : [file].csv.gz: corrupted on transfer: sizes differ 35952 vs 509855
2020/03/11 11:07:50 INFO : [file].csv.gz: Removing failed copy
If I add the --ignore-size flag, then the md5 check will fail. If I add the --ignore-checksum flag, then the download succeeds, but if I change my destination from my local drive to an S3 bucket, it fails again. It is my understanding that the --no-gzip-encoding flag should download the compressed file as is, without decompressing on the fly, and therefore the size and md5 should match, but it's possible I am misunderstanding that.
I also checked the file downloaded when I used both --ignore-size and --ignore-checksum flags (as well as --no-gzip-encoding), and while it has the .gz extension, if I remove it and just have it as a .csv, it opens fine in a text editor/excel/etc. So it feels like it's been uncompressed during the download.
My ultimate goal is to sync a Google Cloud Storage bucket with an S3 bucket as is. I know rclone is contemplating a -z or -Z flag that would simplify this process, but until then I'm trying to see if I can do as much as I can with rclone and add the extra steps manually.
How does it fail - is it the upload the S3 which fails?
I think that is correct, using the --no-gzip-encoding flag should cause rclone not to decompress incoming gzip files which should cause the size and md5sum to match.
That obviously isn't working though...
Can you remind me how you upload a gzip encoded file to gcs? I'll have a go locally.
2020/03/12 07:48:58 DEBUG : pacer: low level retry 1/1 (error Put https://[bucket].s3.us-west-2.amazonaws.com/[file]?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=[credential]%2F20200312%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20200312T144858Z&X-Amz-Expires=900&X-Amz-SignedHeaders=content-md5%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-mtime&X-Amz-Signature=[signature]: net/http: HTTP/1.x transport connection broken: http: ContentLength=14310 with Body length 63426)
2020/03/12 07:48:58 DEBUG : pacer: Rate limited, increasing sleep to 2s
It looks like even if the size is ignored on download, it is not ignored on upload.
I gave this beta a try and it worked as you said. Files were successfully transferred from Google Cloud to local with matching MD5s, and also successfully to S3 (although they lacked the Encoding Type metadata as you said they would). This gets me 90% of the way to where I'm trying to get to; I'll look into a script that can set the Encoding Type on the files in S3 once uploaded.
I'm looking to do this as part of an automated process, is there a release with this patch expected soon and/or a way to install this beta version of rclone without manually downloading a .deb file?
Ah, unfortunately AWS needs to know the size of the file before upload and rclone tells it the wrong size.
Great
I haven't merged this patch yet, as I still am not 100% sure it is the right thing to do.
Maybe we could discuss it.
Currently a file stored with Content-Encoding: gzip can't easily be downloaded.
I guess the choices are
rclone downloads it compressed as per the patch
this works well for syncing as sizes and md5sums are correct
this is perhaps confusing for the user if they download file.txt and try to open it and it is compressed
rclone downloads the file and decompresses it on the fly
it checks the md5sum and size of the compressed file somehow (this is likely quite hard and not at all now rclone works at the moment and wouldn't work with rclone mount because we need to know the size of the original file in the directory listing.
perhaps what the user expects
rclone could advertise the size of the file as -1 which means indeterminate and the md5sum as empty and decompress the file on download
this will work fine for syncing
this won't work terribly well for rclone mount
this produces uncompressed files which the user is expecting.
the md5sum doesn't get checked (though potentially the backend could fill in the md5sum of the uncompressed data and the size once it is known).
So 1. is much more natural for rclone but 2. might be more natural for the user. 3 might be OK too and would be easy to implement.
Maybe rclone should do 1 or 3 depending on a flag - say --no-decompress
Perhaps 3 should be the default (principle of least suprise).
Option 3 would fit in with a -z flag (which would cause rclone to gzip data as it uploads it and set the Content-Encoding).
You could then do your sync from gcs to s3 either with
The first would recompress the data, the second wouldn't (but doesn't set the Content-Encoding).
I am trying to download content from Google Cloud Storage that is stored as compressed .gz files with Content-Encoding:gzip
So your .gz files are stored with Content-Encoding:gzip which will mean that they get decompressed by most browsers when they are downloaded. Is that correct? For these option 1 would be perfect but I don't think that most uses of Content-Encoding: gzip add a .gz to the file name.
I can't speak for other use cases that want to serve compressed data directly to browsers, but our use case is that we have a client who dumps data into a Google Cloud bucket and we want to sync that bucket with a bucket in AWS that's under our control. We'd like our bucket to mirror theirs exactly, so having the files in the destination be exactly the same size, md5, compression, etc as the files in the source is what's desired for us. It looks like the -z flag would accomplish that.
I can only speak for myself, but when I started using rclone (and gsutil before that) I expected commands like copy/cp and sync/rsync to transfer files unaltered. Decompressing something with a Content-Encoding feels more like the consequence of a "download" or "fetch" than a "copy" or a "sync."
But then others may have a different use case from I and come at the question from a different perspective.