I am seeing an issue with rclone hanging in the case when a tcp connection stops sending traffic. This seems to be very similar to a couple of similar reported issues but not exactly the same. We are seeing the issue on downloads from a S3 compatible system to local disk with high concurrency (e.g. 40 transfers) and high throughput (e.g. 10Gbps)
I found that setting --time-out=1m works around the issue. rclone will delete the partial download and restart the download again with this setting. What I don’t understand is that the default time-out setting (5 minutes) causes rclone to hang and never delete and retry these partial downloads. I suspect the root cause is in the Go library since we have another S3 client written in Go that has the exact same issue. I’m interested to see if others are seeing this issue and if anyone has ideas on where the bug is.
As for the root cause of the tcp connection not sending any data I suspect an issue in the network, but a client still shouldn’t hang if a connection stops sending traffic.
Since this is my first post here I will say rclone is awesome and has allowed me to do rapid testing of our S3 alike system without needing to write code, and also grab files from other oddball systems like box.com, I’m astounded at how useful this tool is…
Interesting… I’ve had the suspicion over the years that there is something wrong with the timeout mechanism in rclone, but I’ve never managed to reproduce a problem in any of my tests.
Doubly interesting that --timeout 1m works but --timeout 5m doesn’t.
rclone implements an idle timeout on its data channels, so the idea is that it is supposed to timeout after 5 minutes of the channel being idle if it is in the middle of a transfer. That isn’t a standard go feature - it is implemented in rclone.
That is the code I’m suspicious of not always working but I haven’t been able to figure out why.
I agree 100%.
I did a little test to see if the timeout was working…
I started a transfer from s3 then I used netstat to find the IP address in use and put in iptables rules to block the input and output to that IP. After 5 minutes I saw
Thanks for the reply @ncw.
I created an issue: https://github.com/ncw/rclone/issues/2057
Attached to the issue is a verbose log file from both 1m and 5m timeout setting. In the 5m setting it detects some dead connections but not all of them which results in the hang.
I’m wondering if there is enough keepalive traffic in the 5m windows for rclone to think the connection is still alive (out of my depth here, so this may be way off target…)