Missing hash when using rcat

madzikowski · February 6, 2022, 11:35am

When transferring the stream using rclone rcat to AWS S3, with the file size >10 MB (thus multipart-upload is used), the object does not have a hash set. The ETag is something like 7e1d484c8fafe880099947d7b9d9fb82-1 and there is no X-Amz-Meta-Md5chksum metadata tag.

To reproduce:

rclone config:

[backup]
type = s3
provider = aws
env_auth = true
acl = private
region = eu-west-1

Commands:

$ tar -zcvf - "/my-dir" | rclone rcat -Pv backup:my-bucket/my-dir.tar.gz
$ rclone hashsum md5 backup:my-bucket/
                                my-dir.tar.gz
dc49391bf8a63cc7f93e7ffcf9d1c82e  other-file.txt

However, if we modify the object in any way to create a new version of it, the new ETag will be a correct md5 checksum. We can do it by adding any metadata to the object.

I tried to use hasher backend with the rcat, but it did not change anything:

[hasher]
type = hasher
remote = backup:
hashes = md5

I'm not sure if that's a bug or maybe a feature request.

My use case:

upload a directory bundling it in a single .tar.gz archive on the fly
check if content changed before uploading a new version next time

Animosity022 · February 6, 2022, 1:01pm

Missed a lot of template stuff

What version are you running?
Can you reproduce with a rclone debug log (-vvv)?

felix@gemini:~$ cat /etc/hosts | rclone rcat DB:hosts -vvv
2022/02/06 08:00:17 DEBUG : Setting --config "/opt/rclone/rclone.conf" from environment variable RCLONE_CONFIG="/opt/rclone/rclone.conf"
2022/02/06 08:00:17 DEBUG : rclone: Version "v1.57.0" starting with parameters ["rclone" "rcat" "DB:hosts" "-vvv"]
2022/02/06 08:00:17 DEBUG : Creating backend with remote "DB:"
2022/02/06 08:00:17 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2022/02/06 08:00:17 DEBUG : Dropbox root '': File to upload is small (107 bytes), uploading instead of streaming
2022/02/06 08:00:17 DEBUG : hosts: Uploading chunk 1/1
2022/02/06 08:00:18 DEBUG : hosts: Uploading chunk 2/1
2022/02/06 08:00:18 DEBUG : Dropbox root '': Adding "/hosts" to batch
2022/02/06 08:00:19 DEBUG : Dropbox root '': Batch idle for 500ms so committing
2022/02/06 08:00:19 DEBUG : Dropbox root '': Committing sync batch length 1 starting with: /hosts
2022/02/06 08:00:19 DEBUG : Dropbox root '': Upload batch completed in 65.410803ms
2022/02/06 08:00:19 DEBUG : Dropbox root '': Committed sync batch length 1 starting with: /hosts
2022/02/06 08:00:19 DEBUG : hosts: dropbox = c4f256b92ec94ab03256dfb14d89d73fa9b6de6d90e077b915660e54400d46d6 OK
2022/02/06 08:00:19 INFO  : hosts: Copied (new)
2022/02/06 08:00:19 DEBUG : 11 go routines active
2022/02/06 08:00:19 INFO  : Dropbox root '': Commiting uploads - please wait...

asdffdsa · February 6, 2022, 3:47pm

hello and welcome to the forum,

i can confirm your output and @Animosity022 output.
and might explain why the OP command did not store the hash whereas @Animosity022 command did.

@Animosity022 command uploading instead of streaming
the OP command uploading instead of streaming

imho, odds are, this is not a bug, perhaps a feature that is not currently implemented.
and even if the feature is implemented, still would not work with s3.

with s3, once a file is uploaded, cannot change metadata.
when rclone copy to s3, rclone has to calculate the hash BEFORE upload starts.

and rclone rcat cannot know the hash until the streaming ends.

and take a read of this
https://forum.rclone.org/t/rclone-copy-entire-drive/25155

madzikowski · February 6, 2022, 9:30pm

Thanks for the answers and explanation.

@Animosity022 sorry, I started this as a feature type, then changed to "suspected bug" and the template didn't load since I already wrote something, I guess.

@asdffdsa I understand. I had an assumption that rclone should always save the hash. In this case it would probably need to calculate it on the fly from the stream and update the object metadata after the upload is completed. Probably not worth it.

asdffdsa · February 6, 2022, 9:35pm

sure, the main thing is this is not a rclone bug.
with your permission, i will change from suspected bug to help and support, ok?

as i understand, that is not possible with s3
to update metadata, that would trigger a server-side copy.

madzikowski · February 6, 2022, 9:36pm

sure!

I think you are right. So even worse.

ivandeex · February 6, 2022, 11:35pm

Strictly speaking, this COULD be implemented at the cost of longer upload and wasted local disk (thus optional under a flag) by spooling the streamed-in data in a temporary local file together with hash calculation, then updating s3 metadata and uploading upstream.

Cloud storages frequently skip hashsum in case of streaming or multipart uploads etc, for similar reasons (constraining spool space). Thus, it's an unimplemented niche feature, definitely not a bug.

ncw · February 7, 2022, 10:44am

An old version of rclone did do exactly that. However adding a hash to an object after it is uploaded requires server side copying it, which is an expensive operation so I removed it after user complaints!

This should maybe be an optional flag, I'm not sure.

Note that you can set this flag larger

  --streaming-upload-cutoff SizeSuffix   Cutoff for switching to chunked upload if file size is unknown, upload starts after reaching cutoff or when file ends (default 100Ki)

Rclone will buffer the stream in memory, then upload with a checksum below that limit.

That is not a bad idea so a --streaming-upload-disk-cutoff or something like that.

system · March 9, 2022, 10:44am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.