Hashsum not re-downloading the files if an error happened (at least on 502?)

carles · February 19, 2021, 8:44am

What is the problem you are having with rclone?

hashshum option doesn't try to re-downloaded files if it fails during the hashing.

We are doing quite big hashsum. The server on occasions returns 50X. When problems happens I have:

ERROR  file_path/file_name.txt

The files are readable when I try in a later time. The server just has a "bad moment". I think (correct me if I'm wrong!) that rclone doesn't try to re-download the files on the hashum operation which would be fantastic. Let me know if you need the "-vv" log: I have a few more large hashshum's to do the next two weeks.

What is your rclone version (output from `rclone version`)

rclone v1.54.0
- os/arch: linux/amd64
- go version: go1.15.7

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Linux 64 bit (Ubuntu 20.04.2 LTS, rclone installed from the .zip)

Which cloud storage system are you using? (eg Google Drive)

S3 based storage (based on Scality)

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone hashsum MD5 spibucket2-readonly:bucket_name --download --output-file verifications/bucket_name.txt --progress --transfers=30

The rclone config contents with secrets removed.

[spibucket2-readonly]
type = s3
env_auth = false
access_key_id = xxxx
secret_access_key = xxxx
region = 
endpoint = the_end_point
location_constraint = 
acl = private
server_side_encryption = 
storage_class =

A log from the command with the `-vv` flag

I don't have the output of -vv and since it's a server side thing I'm not able to reproduce right now.

I still have part of the output:

2021-02-18 23:10:31 ERROR : a_file_path/file_name.bin: Failed to open file file_path... : SerializationError: failed to unmarshal error message
        status code: 502, request id: , host id:
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 0d 0a  3c 68 65 61 64 3e 3c 74  |<html>..<head><t|
00000010  69 74 6c 65 3e 35 30 32  20 42 61 64 20 47 61 74  |itle>502 Bad Gat|
00000020  65 77 61 79 3c 2f 74 69  74 6c 65 3e 3c 2f 68 65  |eway</title></he|
00000030  61 64 3e 0d 0a 3c 62 6f  64 79 20 62 67 63 6f 6c  |ad>..<body bgcol|
00000040  6f 72 3d 22 77 68 69 74  65 22 3e 0d 0a 3c 63 65  |or="white">..<ce|
00000050  6e 74 65 72 3e 3c 68 31  3e 35 30 32 20 42 61 64  |nter><h1>502 Bad|
00000060  20 47 61 74 65 77 61 79  3c 2f 68 31 3e 3c 2f 63  | Gateway</h1></c|
00000070  65 6e 74 65 72 3e 0d 0a  3c 68 72 3e 3c 63 65 6e  |enter>..<hr><cen|
00000080  74 65 72 3e 6e 67 69 6e  78 3c 2f 63 65 6e 74 65  |ter>nginx</cente|
00000090  72 3e 0d 0a 3c 2f 62 6f  64 79 3e 0d 0a 3c 2f 68  |r>..</body>..</h|
000000a0  74 6d 6c 3e 0d 0a                                 |tml>..|

caused by: expected element type <Error> but have <html>

Thank you very much!

ncw · February 19, 2021, 3:49pm

I had a look at the code for this and it looks like rclone should be doing retries.

In particular, if it gets an error it should reopen the file and seek to the point it had got to.

In order to diagnose this a log with -vv would be really helpful - thanks!

carles · February 19, 2021, 4:04pm

Tonight I need to start a hashsum of 18TB and 4 million files. I'll make sure to keep the log and hope for an error.

Question: would rclone re-try the same file just next to failing or enqueue it for way later on? I've seen that in the object storage that I'm working sometimes there are a few seconds where all requests fail and then they work again. So for me it would be useful if it re-tries later on not just after the failure.

Thanks!

asdffdsa · February 19, 2021, 4:54pm

hi,

i have no idea but perhaps this might work better in your use case?
https://rclone.org/commands/rclone_md5sum/

carles · February 19, 2021, 5:23pm

I'm quite sure (but I'll check better) that rclone md5sum is exactly the same as rclone hashsum MD5. It seems that I like typing!

Thanks!

carles · February 20, 2021, 3:32pm

Quick question: once rclone has written a line like:

                           ERROR  file_path/file_name

in the hash output: Will it try again? Or that's all and has given up for this file?

Why am I asking? I know that it had written this line (actually 47 of those), but once rclone has finished I pressed arrow-up and enter and I overwrote the hashes file I have the verbose-log (it's not overwritten by default) but I don't know if rclone might have tried it again downloading the files and succeeded and added another line in the output file. I can provide the interesting parts of the verbose log to see the exact failures.

Thank you!

carles · February 21, 2021, 9:53am

I executed the hashum MD5 --download and I have errors in the log file. I'll post it here this afternoon.

ncw · February 21, 2021, 12:52pm

I think for hashsum --download it will attempt to re-open the file up to --low-level-retries times then give up. It doesn't do a high level --retry as it is streaming the hash outputs.

It could potentially try harder though.

carles:

Quick question: once rclone has written a line like:
                           ERROR  file_path/file_name
in the hash output: Will it try again? Or that's all and has given up for this file?

If it has gotten as far as writing ERROR then it has given up for that file.

You could potentially grep those lines out then feed them back into rclone with a --files-from.

It would be nice if rclone did this automatically though on a high level retry!

Yes please!

carles · February 21, 2021, 3:09pm

Good that rclone is streaming: some of the files don't fit in the local storage of the machine where rclone is running.

Good tip! I had used rclone hashum MD5 with the specific file_path/file_name but then the output of hashsum was the filename only... (I did a bit of scripting). Next time I'll use --files-from!

For the last run I had 3 files with an error out of 4 million files, 18TB.
(by the way, earlier this week I found a real error thanks to the hashsum --download: a file was listed in the object storage, with its metadata, but it was not retrievable)

For two of the files I need to look at them a bit better. One case is below: it seems that it was a low level error and rclone tried so many times near in time that none succeeded:

2021/02/20 22:46:51 DEBUG : pacer: low level retry 1/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:51 DEBUG : pacer: low level retry 2/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2021/02/20 22:46:51 DEBUG : pacer: low level retry 3/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 40ms
2021/02/20 22:46:51 DEBUG : pacer: low level retry 4/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 80ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 60ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 45ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 33.75ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 25.3125ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 18.984375ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 14.238281ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 10.67871ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 5/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 6/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 7/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 8/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 9/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 10/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 ERROR : directory1/directory2/directory3/file_name: Failed to open file directory1/directory2/directory3/file_name: RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s

File size was about 240 KB.

This "too many open files opened": is it in the rclone side?
If I do ulimit -n returns 1024. I don't know if it has anything to do with this limit? I'm using --transfers=30 because of many small files but I haven't checked if 20 would be enough.

ncw · February 22, 2021, 10:28am

Yes it is from the rclone side.

Yes it does. That means you can have 1024 file descriptors open at any one time (which includes network connections).

You should need approx --checkers + --transfers file descriptors or maybe a small multiple of that, which shouldn't get any where near 1024.

I wonder if rclone has a socket leak somewhere...

If you run lsof -c rclone while it is running do you see the number of sockets increasing?

If so then if you run rclone with the --rc flag, you can do some debugging. The thing to look at first would be the number of goroutines: https://rclone.org/rc/#debugging-go-routine-leaks as sockets are very often leaked with goroutines.

carles · February 23, 2021, 9:43am

Next time that I do this (in a couple of days) I'll keep an eye on this to see if it grows more than expected and try the -rc flag.

carles · March 18, 2021, 12:07am

I haven't been able to reproduce the problem. I know that the Ceph object storage had some problems when I had the problem (it was not available at certain times every hour) and I think that maybe the problem was caused by this. I keep an eye though on this and report again in the future with the --rc flag if I can reproduce it.

system · May 17, 2021, 8:08pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.