hashshum option doesn't try to re-downloaded files if it fails during the hashing.
We are doing quite big hashsum. The server on occasions returns 50X. When problems happens I have:
ERROR file_path/file_name.txt
The files are readable when I try in a later time. The server just has a "bad moment". I think (correct me if I'm wrong!) that rclone doesn't try to re-download the files on the hashum operation which would be fantastic. Let me know if you need the "-vv" log: I have a few more large hashshum's to do the next two weeks.
What is your rclone version (output from rclone version)
rclone v1.54.0
- os/arch: linux/amd64
- go version: go1.15.7
Which OS you are using and how many bits (eg Windows 7, 64 bit)
Linux 64 bit (Ubuntu 20.04.2 LTS, rclone installed from the .zip)
Which cloud storage system are you using? (eg Google Drive)
S3 based storage (based on Scality)
The command you were trying to run (eg rclone copy /tmp remote:tmp)
Tonight I need to start a hashsum of 18TB and 4 million files. I'll make sure to keep the log and hope for an error.
Question: would rclone re-try the same file just next to failing or enqueue it for way later on? I've seen that in the object storage that I'm working sometimes there are a few seconds where all requests fail and then they work again. So for me it would be useful if it re-tries later on not just after the failure.
Quick question: once rclone has written a line like:
ERROR file_path/file_name
in the hash output: Will it try again? Or that's all and has given up for this file?
Why am I asking? I know that it had written this line (actually 47 of those), but once rclone has finished I pressed arrow-up and enter and I overwrote the hashes file I have the verbose-log (it's not overwritten by default) but I don't know if rclone might have tried it again downloading the files and succeeded and added another line in the output file. I can provide the interesting parts of the verbose log to see the exact failures.
I think for hashsum --download it will attempt to re-open the file up to --low-level-retries times then give up. It doesn't do a high level --retry as it is streaming the hash outputs.
It could potentially try harder though.
If it has gotten as far as writing ERROR then it has given up for that file.
You could potentially grep those lines out then feed them back into rclone with a --files-from.
It would be nice if rclone did this automatically though on a high level retry!
Good that rclone is streaming: some of the files don't fit in the local storage of the machine where rclone is running.
Good tip! I had used rclone hashum MD5 with the specific file_path/file_name but then the output of hashsum was the filename only... (I did a bit of scripting). Next time I'll use --files-from!
For the last run I had 3 files with an error out of 4 million files, 18TB.
(by the way, earlier this week I found a real error thanks to the hashsum --download: a file was listed in the object storage, with its metadata, but it was not retrievable)
For two of the files I need to look at them a bit better. One case is below: it seems that it was a low level error and rclone tried so many times near in time that none succeeded:
2021/02/20 22:46:51 DEBUG : pacer: low level retry 1/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:51 DEBUG : pacer: low level retry 2/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2021/02/20 22:46:51 DEBUG : pacer: low level retry 3/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 40ms
2021/02/20 22:46:51 DEBUG : pacer: low level retry 4/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:51 DEBUG : pacer: Rate limited, increasing sleep to 80ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 60ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 45ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 33.75ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 25.3125ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 18.984375ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 14.238281ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 10.67871ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 5/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 6/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 7/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 8/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 9/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
2021/02/20 22:46:52 DEBUG : pacer: low level retry 10/10 (error RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files)
2021/02/20 22:46:52 DEBUG : pacer: Rate limited, increasing sleep to 10ms
2021/02/20 22:46:52 ERROR : directory1/directory2/directory3/file_name: Failed to open file directory1/directory2/directory3/file_name: RequestError: send request failed
caused by: Get "https://a_ceph_server/bucket_name/directory1/directory2/directory3/file_name": dial tcp [some:ipv6::214]:443: socket: too many open files
2021/02/20 22:46:52 DEBUG : pacer: Reducing sleep to 0s
File size was about 240 KB.
This "too many open files opened": is it in the rclone side?
If I do ulimit -n returns 1024. I don't know if it has anything to do with this limit? I'm using --transfers=30 because of many small files but I haven't checked if 20 would be enough.
Yes it does. That means you can have 1024 file descriptors open at any one time (which includes network connections).
You should need approx --checkers + --transfers file descriptors or maybe a small multiple of that, which shouldn't get any where near 1024.
I wonder if rclone has a socket leak somewhere...
If you run lsof -c rclone while it is running do you see the number of sockets increasing?
If so then if you run rclone with the --rc flag, you can do some debugging. The thing to look at first would be the number of goroutines: https://rclone.org/rc/#debugging-go-routine-leaks as sockets are very often leaked with goroutines.
I haven't been able to reproduce the problem. I know that the Ceph object storage had some problems when I had the problem (it was not available at certain times every hour) and I think that maybe the problem was caused by this. I keep an eye though on this and report again in the future with the --rc flag if I can reproduce it.