Rclone Docker Volume - How to not freeze container when source is unavailable

zambet · August 28, 2021, 8:59am

What is the problem you are having with rclone?

I am prototyping a systems design and it's looking very nice with a SFTP server (A) and another server (B) that rclone presents the SFTP as a volume to the docker containers. Seems to work well until I kill the SFTP and instead of getting a timeout or something on the file IO the terminal hangs... it will come back about two minutes or so after I restore the SFTP.

What I would like to do is have it timeout within a second and then in my application in the docker I will go to another Rclone docker volume that contains a backup location on B2 to look for the files.

Is there a way on the Rclone side or on the docker container side to timeout the IO?

What is your rclone version (output from `rclone version`)

rclone v1.56.0

os/version: ubuntu 20.04 (64 bit)
os/kernel: 5.4.0-80-generic (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.16.5
go/linking: static
go/tags: none

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Ubuntu 20.04

Which cloud storage system are you using? (eg Google Drive)

SFTP

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

docker volume create n3cached2 -d rclone -o remote=node3sftp: -o poll-interval=0 -o vfs-cache-mode=full -o allow-other=true -o vfs-cache-max-size=15G -o dir-cache-time=0

docker run --rm -it -v n3cached2:/n3 --workdir /n3 ubuntu:latest bash

The rclone config contents with secrets removed.

/var/lib/docker-plugins/rclone/config/rclone.conf

[node3sftp]
type = sftp
host = IP Address
user = user
pass = password
md5sum_command = md5sum
sha1sum_command = sha1sum

A log from the command with the `-vv` flag

Paste  log here

zambet · August 28, 2021, 9:09am

Interesting, there is some kind of retry happening somewhere as when I reconnect server A while doing a transfer inside the container it eventually sees it and resumes the transfer after maybe 4 minutes.

I also note that this works for the rclone mount:
rclone mount node3sftp: /mnt/2 --daemon-timeout=1s

but not here:

docker volume create n3cached3 -d rclone -o remote=node3sftp: -o poll-interval=0 -o vfs-cache-mode=full -o allow-other=true -o vfs-cache-max-size=15G -o dir-cache-time=0 -o daemon-timeout=1s

***Maybe it was a fluke that the above worked as I can't replicate it again?

zambet · August 28, 2021, 9:43am

In the container,

for mount:

node3sftp: on /n3 type fuse.rclone (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

maybe I need somehow to get the daemon-timeout in there?

ivandeex · August 28, 2021, 10:33am

The --daemon-timeout name is misleading. It would be better named --macos-mount-kernel-timeout but it's called what it's called, we'll keep the name for compatibility. Actually this option works only on MacOS and sets timeout in the kernel mount handler. It's useless on other systems.

I'm not a guru of kernel mount APIs but AFAIK the FUSE API does not provide the mount provider with much means to signal a critical condition or ask to terminate. rclone is left to retry forever on all errors, as soon as mount is established. All you can do is "docker kill" your container.

zambet · August 28, 2021, 11:22am

Thanks for your reply, I have some verbose logs now:

I found some new global flags that looked interesting but don't help:

rclone mount node3sftp: /mnt/2 --timeout=1s -vv --contimeout=1s --retries=3

``
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: Read: len=131072, offset=1067974656
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: >Read: read=131072, err=
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: Read: len=131072, offset=1068105728
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: >Read: read=131072, err=
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: Read: len=131072, offset=1068236800
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: >Read: read=131072, err=
2021/08/28 13:19:34 DEBUG : &{test1.img (r)}: Read: len=131072, offset=1068367872
2021/08/28 13:19:34 ERROR : test1.img: ReadFileHandle.Read error: low level retry 1/10: connection lost
2021/08/28 13:19:34 DEBUG : test1.img: ReadFileHandle.seek from 1068367872 to 1068367872
2021/08/28 13:19:34 DEBUG : test1.img: ReadFileHandle.Read seek close old failed: connection lost
2021/08/28 13:19:34 DEBUG : test1.img: ChunkedReader.RangeSeek from -1 to 1068367872 length -1
2021/08/28 13:19:34 DEBUG : test1.img: ChunkedReader.openRange at 1068367872 length 134217728
2021/08/28 13:19:34 ERROR : sftp://remoterclone@10.0.0.4:22/: Discarding closed SSH connection: read tcp 10.0.0.3:42538->10.0.0.4:22: i/o timeout
2021/08/28 13:19:35 DEBUG : pacer: low level retry 1/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:35 DEBUG : pacer: Rate limited, increasing sleep to 200ms
2021/08/28 13:19:36 DEBUG : pacer: low level retry 2/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:36 DEBUG : pacer: Rate limited, increasing sleep to 400ms
2021/08/28 13:19:37 DEBUG : pacer: low level retry 3/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:37 DEBUG : pacer: Rate limited, increasing sleep to 800ms
2021/08/28 13:19:38 DEBUG : pacer: low level retry 4/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:38 DEBUG : pacer: Rate limited, increasing sleep to 1.6s
2021/08/28 13:19:39 DEBUG : pacer: low level retry 5/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:39 DEBUG : pacer: Rate limited, increasing sleep to 2s
2021/08/28 13:19:41 DEBUG : pacer: low level retry 6/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:43 DEBUG : pacer: low level retry 7/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:45 DEBUG : pacer: low level retry 8/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:47 DEBUG : pacer: low level retry 9/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:49 DEBUG : pacer: low level retry 10/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:49 DEBUG : test1.img: ReadFileHandle.Read seek failed: Open: couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout
2021/08/28 13:19:49 ERROR : test1.img: ReadFileHandle.Read error: low level retry 2/10: Open: couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout
2021/08/28 13:19:49 DEBUG : test1.img: ReadFileHandle.seek from 1068367872 to 1068367872
2021/08/28 13:19:49 DEBUG : test1.img: ReadFileHandle.Read seek close old failed: file already closed
2021/08/28 13:19:49 DEBUG : test1.img: ChunkedReader.RangeSeek from -1 to 1068367872 length -1
2021/08/28 13:19:49 DEBUG : test1.img: ChunkedReader.openRange at 1068367872 length 134217728
2021/08/28 13:19:51 DEBUG : pacer: low level retry 1/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:53 DEBUG : pacer: low level retry 2/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:55 DEBUG : pacer: low level retry 3/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)
2021/08/28 13:19:57 DEBUG : pacer: low level retry 4/10 (error couldn't connect SSH: dial tcp 10.0.0.4:22: i/o timeout)


So why is the retry 10 and is there somewhere I can change it?

zambet · August 28, 2021, 11:29am

Ah! This works for the Rclone mount giving an I/O error and also answering future file requests when the source returns.

--low-level-retries NUMBER

This controls the number of low level retries rclone does.

A low level retry is used to retry a failing operation - typically one HTTP request. This might be uploading a chunk of a big file for example. You will see low level retries in the log with the -v flag.

This shouldn't need to be changed from the default in normal operations. However, if you get a lot of low level retries you may wish to reduce the value so rclone moves on to a high level retry (see the --retries flag) quicker.

Disable low level retries with --low-level-retries 1.

Is there any way to support the same in the Docker volume driver?

zambet · August 28, 2021, 12:07pm

Got it!

docker plugin install rclone/docker-volume-rclone:latest --grant-all-permissions --alias rclone

docker plugin disable rclone

docker plugin set rclone RCLONE_VERBOSE=2 args="--vfs-cache-mode=writes --allow-other --low-level-retries=1 --timeout=1s --contimeout=1s --retries=2"

docker plugin enable rclone

docker volume create n3cached4 -d rclone -o remote=node3sftp: -o poll-interval=0 -o vfs-cache-mode=full -o allow-other=true -o vfs-cache-max-size=15G -o dir-cache-time=0

docker run --rm -it -v n3cached4:/n3 --workdir /n3 ubuntu:latest bash

*Those retries I will look again with fresh eyes and see if they need tweaking but it's working and instantly responds with an I/O error in the docker container

system · October 28, 2021, 8:07am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.