Rclone stucks at the end of a big file upload


#1

Hello!

I’m using rclone to upload a lot of files from my NAS to the Yandex Disk cloud. I’m also using “crypt” option to encrypt them before transfer (“crypt” is set on top of “yandex” config). My internet connection speed is about 100Mbps, speed displayed by rclone is something like “10,xxx Mbytes/s” at almost any moment.

At first a had a problem where after finishing to upload a big file (bigger than about 10GB) rclone had just stopped for a few minutes and then started to upload the same file again, it repeated the process infinite amount of times and never actually uploaded the file. I collected a debug log and it was like this:

2018/12/27 21:52:11 DEBUG : rclone: Version "v1.45" starting with parameters ["rclone" "copy" "/mnt/syno/FILE.mkv" "yandex-crypt:syno" "--stats" "1s" "--progress" "--max-backlog" "50000" "--log-file=/home/user1/rclone-logs/rclone-backup.log" "--log-level" "DEBUG"]
2018/12/27 21:52:11 DEBUG : Using config file from "/home/user1/.config/rclone/rclone.conf"
2018/12/27 21:52:13 DEBUG : FILE.mkv: Couldn't find file - need to transfer
2018/12/27 22:41:50 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2018/12/27 22:41:50 DEBUG : pacer: low level retry 1/10 (error Put https://uploader22g.disk.yandex.net:443/upload-target/[cleaned]: net/http: timeout awaiting response headers)
2018/12/27 22:41:50 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:50 DEBUG : pacer: Rate limited, increasing sleep to 40ms
2018/12/27 22:41:50 DEBUG : pacer: low level retry 2/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:50 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:50 DEBUG : pacer: Rate limited, increasing sleep to 80ms
2018/12/27 22:41:50 DEBUG : pacer: low level retry 3/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:50 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:50 DEBUG : pacer: Rate limited, increasing sleep to 160ms
2018/12/27 22:41:50 DEBUG : pacer: low level retry 4/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:50 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:50 DEBUG : pacer: Rate limited, increasing sleep to 320ms
2018/12/27 22:41:50 DEBUG : pacer: low level retry 5/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:51 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:51 DEBUG : pacer: Rate limited, increasing sleep to 640ms
2018/12/27 22:41:51 DEBUG : pacer: low level retry 6/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:51 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:51 DEBUG : pacer: Rate limited, increasing sleep to 1.28s
2018/12/27 22:41:51 DEBUG : pacer: low level retry 7/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:52 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:52 DEBUG : pacer: Rate limited, increasing sleep to 2s
2018/12/27 22:41:52 DEBUG : pacer: low level retry 8/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:53 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:53 DEBUG : pacer: low level retry 9/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:55 DEBUG : Couldn't decode error response: EOF
2018/12/27 22:41:55 DEBUG : pacer: low level retry 10/10 (error [500 - ]  (500 Server Error))
2018/12/27 22:41:55 DEBUG : FILE.mkv: Received error: [500 - ]  (500 Server Error) - low level retry 1/10
2018/12/27 22:41:57 DEBUG : pacer: Reducing sleep to 1.5s
2018/12/27 22:41:59 DEBUG : pacer: Reducing sleep to 1.125s

After a little bit of googling I’ve found a topic where people had similar problems and it was recommended there to increase “–timeout” to some significant value. I’ve tried with “–timeout 300m” and it sort of helped.

I’m saying “sort of” because now after rclone finishes to upload a big file it’s just stucks at 100% for some time before the transfer finally completes. This time can be long, i.e. about 30-40 minutes for a file of about 30GB.

Also I’m occasionally seeing strings like this in the log after transferring a big files, even though the files seems to be eventually transferred successfully:

2018/12/28 03:42:20 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2018/12/28 03:42:20 DEBUG : pacer: low level retry 1/10 (error [500 - InternalServerError] Internal Server Error (Ошибка сервера.))
2018/12/28 03:42:20 DEBUG : pacer: Reducing sleep to 15ms
2018/12/28 03:42:20 DEBUG : pacer: Reducing sleep to 11.25ms
2018/12/28 03:42:20 DEBUG : pacer: Reducing sleep to 10ms

So, with this being said I wanted to ask a few questions, some for educational and some for practical purposes:

  1. Why is rclone stucks after transferring a big file and what exactly is it doing while being stucked? Is there some checking going on or something else?

  2. Is there any other way to make sure that big files will be transferred successfully without increasing the timeout? And isn’t it considered an rclone bug that without increasing timeout value transfers fail?

  3. Could there be any side effects of increasing timeout to some extreme values? I’m currently have a timeout set to 1440 minutes (24 hours) just because I want to be sure that all my files will be transferred without errors. while I’m not keeping eye on the process.


#2

That error doesn’t look like a timeout but a 500 from Yandex. I’m not sure increasing the timeout would do anything.

Are they have a problem on their end?


#3

It doesn’t seem that they have any problems, and I also don’t think that the issue is caused by problems on their side.

As I said, once I increased timeout value the file was transferred successfully and with default timeout value transfer always failed (my server tried to upload the same file all night before I noticed the problem in the morning).

Here is the log of successful transfer with timeout set to 300m:

2018/12/27 22:50:06 DEBUG : rclone: Version "v1.45" starting with parameters ["rclone" "copy" "/mnt/syno/FILE.mkv" "yandex-crypt:syno" "--stats" "1s" "--progress" "--max-backlog" "50000" "--timeout" "300m" "--log-file=/home/user1/rclone-logs/rclone-backup.log" "--log-level" "DEBUG"]
2018/12/27 22:50:06 DEBUG : Using config file from "/home/user1/.config/rclone/rclone.conf"
2018/12/27 22:50:07 DEBUG : FILE.mkv: Couldn't find file - need to transfer
2018/12/28 00:00:38 INFO  : FILE.mkv: Copied (new)
2018/12/28 00:00:38 INFO  : 
Transferred:   	   28.754G / 28.754 GBytes, 100%, 6.958 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:  1h10m31.8s

2018/12/28 00:00:38 DEBUG : 4 go routines active
2018/12/28 00:00:38 DEBUG : rclone: Version "v1.45" finishing with parameters ["rclone" "copy" "/mnt/syno/FILE.mkv" "yandex-crypt:syno" "--stats" "1s" "--progress" "--max-backlog" "50000" "--timeout" "300m" "--log-file=/home/user1/rclone-logs/rclone-backup.log" "--log-level" "DEBUG"]

The only problem with this transfer is that before at was actually finished rclone displayed its status like that:

* syno/FILE.mkv: 100% /28.754G, 0/s -

for almost 40 minutes


#4

This part of the error will definitely be improved by increasing --timeout

rclone is waiting for Yandex to respond that the file has been uploaded.

Yandex is probably calculating the MD5SUM at that point - that is exactly what Amazon Drive does and it has the same long pause at the end.

That is probably just Yandex being overloaded - rclone will try again.

I think I answered that above

Hmm, yes it could be considered a bug definitely. Or at very least there should be a note in the documentation.

Do you fancy submitting a patch to the yandex docs explaining that timeout might need to be raised?

It will mean that rclone will not notice if a connection goes bad for longer - that is all.


#5

Thanks a lot for the answers! I don’t mind to submit a patch, but I’m not really sure what that means and how to do it :grinning:


#6

No worries!

I’ve added this to the docs using your numbers above to make a formula for a timeout I think should work.

Limitations

When uploading very large files (bigger than about 5GB) you will need
to increase the --timeout parameter. This is because Yandex pauses
(perhaps to calculate the MD5SUM for the entire file) before returning
confirmation that the file has been uploaded. The default handling of
timeouts in rclone is to assume a 5 minute pause is an error and close
the connection - you’ll see net/http: timeout awaiting response headers errors in the logs if this is happening. Setting the timeout
to twice the max size of file in GB should be enough, so if you want
to upload a 30GB file set a timeout of 2 * 30 = 60m, that is
--timeout 60m.


#7

Thant’s great! Thanks a lot again!