Retried http download is starting from scratch when copying to GCS

What is the problem you are having with rclone?

I'm streaming a 3793215965 byte (3.5 GiB) file from an https server directly to Google Cloud Storage (GCS). The https server rudely closes the connection before the download is finished, often around the 2.5 GiB mark.

wget does handle this download just fine. It sends a new request with a Range header and resumes.

rclone seems to handle this by starting the download from scratch, but the same thing happens, and the upload never finishes. I eventually aborted rclone after transferring 6.6 GiB. Strangely, --dump headers doesn't show a new request being sent, so I'm not entirely certain that my interpretation is correct.

When copying to the local file system instead of GCS, rclone downloads the file in chunks, and succeeds. However, in practice I'm transferring terabytes of data and I don't have that much local disk space on my VM, so I need to stream directly to GCS. Caching only in-flight files locally would be fine, though, if that helps.

Run the command 'rclone version' and share the full output of the command.

rclone v1.68.2
- os/version: arch (64 bit)
- os/kernel: 6.12.4-arch1-1 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.23.3
- go/linking: dynamic
- go/tags: none

Which cloud storage system are you using?

http, gcs

The command you were trying to run

(In reality I'm using rclone copy, but the command below is a smaller repro case with the same behaviour.)

rclone copyurl \
    https://trein.fwrite.org/DE-Co/KV78_2024-11-14.csv.gz \
    :gcs:my-bucket-name/output-path \
    --auto-filename -vv --dump headers --progress

Please run 'rclone config redacted' and share the full output

; empty config

A log from the command that you were trying to run with the -vv flag

2024/12/20 11:18:08 NOTICE: Automatically setting -vv as --dump is enabled
2024/12/20 11:18:08 DEBUG : rclone: Version "v1.68.2" starting with parameters ["rclone" "copyurl" "https://trein.fwrite.org/DE-Co/KV78_2024-11-14.csv.gz" ":gcs:my-bucket-name/output-path" "--auto-filename" "-vv" "--dump" "headers" "--progress"]
2024/12/20 11:18:08 DEBUG : Creating backend with remote ":gcs:my-bucket-name/output-path"
2024/12/20 11:18:08 NOTICE: Config file "/home/thomas/.config/rclone/rclone.conf" not found - using defaults
2024/12/20 11:18:08 DEBUG : You have specified to dump information. Please be noted that the Accept-Encoding as shown may not be correct in the request and the response may not show Content-Encoding if the go standard libraries auto gzip encoding was in effect. In this case the body of the request will be gunzipped before showing it.
2024/12/20 11:18:08 DEBUG : You have specified to dump information. Please be noted that the Accept-Encoding as shown may not be correct in the request and the response may not show Content-Encoding if the go standard libraries auto gzip encoding was in effect. In this case the body of the request will be gunzipped before showing it.
2024/12/20 11:18:08 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024/12/20 11:18:08 DEBUG : HTTP REQUEST (req 0xc000c34000)
2024/12/20 11:18:08 DEBUG : GET /DE-Co/KV78_2024-11-14.csv.gz HTTP/1.1
Host: trein.fwrite.org
User-Agent: rclone/v1.68.2
Accept-Encoding: gzip
2024/12/20 11:18:08 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024/12/20 11:18:08 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2024/12/20 11:18:08 DEBUG : HTTP RESPONSE (req 0xc000c34000)
2024/12/20 11:18:08 DEBUG : HTTP/1.1 200 OK
Accept-Ranges: bytes
Connection: keep-alive
Content-Type: application/x-gzip
Date: Fri, 20 Dec 2024 10:18:08 GMT
Etag: "e217e1dd-626e768f623cc"
Last-Modified: Thu, 14 Nov 2024 23:00:00 GMT
Server: nginx
Set-Cookie: uid=oWFP22dlRGAywj7rA/7CAg==; path=/
Strict-Transport-Security: max-age=15768000
X-Cache: HIT
X-Cache-Date: Mon, 18 Nov 2024 07:57:00 GMT
2024/12/20 11:18:08 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2024/12/20 11:18:08 DEBUG : KV78_2024-11-14.csv.gz: File name found in url
Transferred:        6.672 GiB / 6.672 GiB, 100%, 14.568 MiB/s, ETA 0s
Transferred:            0 / 1, 0%
Elapsed time:      7m34.1s
Transferring:
 *                        KV78_2024-11-14.csv.gz:  0% /off, 14.568Mi/s, -^C

Wait, I just noticed this is a .csv.gz file (most of them are .csv.xz). Since gzip is a standard transfer mechanism, and the server doesn't support any Content-Encoding except gzip (not even identity!), and rclone outputs some warning about "auto gzip encoding", maybe the encoding is playing tricks here? Is rclone decompressing the content and counting decompressed bytes, instead of just transferring the gzip-compressed bytes verbatim to GCS?

Edit: Seems like the server is stubborn, as well as rude: if I explicitly request Accept-Encoding: identity it still only wants Content-Encoding: gzip.

$ curl -H 'Accept-Encoding: identity' -I https://trein.fwrite.org/DE-Co/KV78_2024-11-14.csv.gz
HTTP/2 200 
server: nginx
date: Fri, 20 Dec 2024 10:46:28 GMT
content-type: application/x-gzip
content-length: 3793215965
last-modified: Thu, 14 Nov 2024 23:00:00 GMT
etag: "e217e1dd-626e768f623cc"
content-encoding: gzip
strict-transport-security: max-age=15768000
x-cache: HIT
x-cache-date: Mon, 18 Nov 2024 07:57:00 GMT
set-cookie: uid=oWFP22dlSwQy7z7sA4aDAg==; path=/
accept-ranges: bytes

This is still within the HTTP spec; sending a 415 Unsupported Media Type response is optional as per RFC 9110. The server only "SHOULD send a response without any content coding unless the identity coding is indicated as unacceptable."

So if my theory is correct, rclone is doing the right thing by decompressing the response, and wget isn't. Maybe wget is assuming that identity encoding is always supported by the server, and doesn't check the response header at all.

The compression factor on these files is over 20x, so I'd rather not decompress them until needed :smiley: