Tcp timeout issue

tekkiwi · February 9, 2018, 12:07am

I am seeing an issue with rclone hanging in the case when a tcp connection stops sending traffic. This seems to be very similar to a couple of similar reported issues but not exactly the same. We are seeing the issue on downloads from a S3 compatible system to local disk with high concurrency (e.g. 40 transfers) and high throughput (e.g. 10Gbps)

I found that setting --time-out=1m works around the issue. rclone will delete the partial download and restart the download again with this setting. What I don’t understand is that the default time-out setting (5 minutes) causes rclone to hang and never delete and retry these partial downloads. I suspect the root cause is in the Go library since we have another S3 client written in Go that has the exact same issue. I’m interested to see if others are seeing this issue and if anyone has ideas on where the bug is.
As for the root cause of the tcp connection not sending any data I suspect an issue in the network, but a client still shouldn’t hang if a connection stops sending traffic.

Since this is my first post here I will say rclone is awesome and has allowed me to do rapid testing of our S3 alike system without needing to write code, and also grab files from other oddball systems like box.com, I’m astounded at how useful this tool is…

ncw · February 9, 2018, 9:46am

Interesting... I've had the suspicion over the years that there is something wrong with the timeout mechanism in rclone, but I've never managed to reproduce a problem in any of my tests.

Doubly interesting that --timeout 1m works but --timeout 5m doesn't.

rclone implements an idle timeout on its data channels, so the idea is that it is supposed to timeout after 5 minutes of the channel being idle if it is in the middle of a transfer. That isn't a standard go feature - it is implemented in rclone.

That is the code I'm suspicious of not always working but I haven't been able to figure out why.

I agree 100%.

I did a little test to see if the timeout was working...

I started a transfer from s3 then I used netstat to find the IP address in use and put in iptables rules to block the input and output to that IP. After 5 minutes I saw

2018/02/09 09:40:36 NOTICE: 100M: Removing partially written file on error: read tcp 10.x.y.z:43042->52.95.150.20:443: i/o timeout
2018/02/09 09:40:36 DEBUG : 100M: Received error: read tcp 10.x.y.z:43042->52.95.150.20:443: i/o timeout - low level retry 1/10

Commands were these if you want to experiment further

# Start Transfer
rclone copy -vv --buffer-size 0 --bwlimit 1M --stats 1s  s3:rclone-big/100M /tmp/big 2>&1 | tee timeout-s3.log
# In another terminal find the IP
netstat -tuanp | grep rclone
# Block it
sudo iptables -I INPUT 1 -s 52.95.150.20 -j DROP ; sudo iptables -I OUTPUT 1 -d 52.95.150.20 -j DROP
# Remove the block when finished
sudo iptables -D INPUT 1 ; sudo iptables -D OUTPUT 1

So the timeout code is working in some situations!

I wonder whether the problem is when the connection never establishes itself. I set all the timeouts in http.Transport and I pass that transport into the s3 sdk.

To debug this further I'd really like to see

logs (with -vv) of it going wrong.
tcpdump of the problem (not sure how exactly you'd capture this though in a multithreaded transfer)
a way of reproducing the problem myself!

If you'd like to continue to debug this then can you please make a new issue on github with as much of the above as you can find!

Thanks

tekkiwi · February 9, 2018, 4:37pm

Thanks for the reply @ncw.
I created an issue: https://github.com/ncw/rclone/issues/2057
Attached to the issue is a verbose log file from both 1m and 5m timeout setting. In the 5m setting it detects some dead connections but not all of them which results in the hang.
I’m wondering if there is enough keepalive traffic in the 5m windows for rclone to think the connection is still alive (out of my depth here, so this may be way off target…)

ncw · February 10, 2018, 9:16am

Thanks - will reply on the issue when I've analysed the logs!