Using rclone with parallel file systems (glusterfs) and docker swarm

What is the problem you are having with rclone?

Not so much a problem by itself, but also wondering am I spinning my wheels with this idea, with things I don't know I don't know.

Background

I've got a little raspberry pi cluster setup, with docker swarm, and a glusterfs mounted filesystem for shared storage between the worker nodes.

What I'd like to do, is have the data on that glusterfs mount be synced using rclone on google drive, such that, the worker containers have access to data that is in the google drive, when they make changes, those changes are synced to google drive, and also the reverse, changes made from other systems also sync to google drive, and are seen by the workers as well.

What I was intending to do was use the rclone docker image, and just make that part of the swarm, so that if that container /worker dies, it just gets picked up by one of the other nodes.

What is currently a problem:

The data is only visible on the worker where the rclone container is running, because it's a mount.

What I think may also be a problem with that idea:

When that rclone container dies, it's going to (probably?) leave the mount in a bad state. Other workers won't actually be able to access the data there anymore, so they'll die. When the rclone container moves to another worker, it'll try to mount again, but the mount might be in a bad state.

Run the command 'rclone version' and share the full output of the command.

docker run --rm rclone/rclone:latest version
rclone v1.61.1

  • os/version: alpine 3.17.0 (64 bit)
  • os/kernel: 5.15.84-v8+ (aarch64)
  • os/type: linux
  • os/arch: arm64
  • go/version: go1.19.4
  • go/linking: static
  • go/tags: none

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

docker run --rm \
--user $(id -u):$(id -g) \
--volume /mnt/gfs.lich/.config/rclone:/config/rclone \
--volume /mnt/gfs.lich/.rclone-cache:/.rclone-cache \
--volume /mnt/gfs.lich/data:/data:shared \
--volume /etc/passwd:/etc/passwd:ro \
--volume /etc/group:/etc/group:ro \
--device /dev/fuse \
--cap-add SYS_ADMIN \
--security-opt apparmor:unconfined \
rclone/rclone mount gdrive:lich /data/lich \
--cache-dir /.rclone-cache/lich \
--allow-other \
--vfs-cache-mode full \
--vfs-case-insensitive \
--dir-cache-time 1000h \
--vfs-cache-max-age 1000h \
--poll-interval 15s \
--vfs-cache-poll-interval 15s

The rclone config contents with secrets removed.

[gdrive]
type = drive
client_id = [REDACTED]
client_secret = [REDACTED]
scope = drive
token = {"access_token":"[REDACTED]","token_type":"Bearer","refresh_token":"[REDACTED]","expiry":"2023-01-23T18:46:46.490186948Z"}
team_drive =

A log from the command with the -vv flag

N/A

Rclone mounts are unlikely to be left in a bad state so you don't need to worry about that.

They don't keep state locally (only cache depending on --vfs-cache-mode).

This happened:

pi@lichswarm1:/mnt/gfs.lich/data $ docker run --rm \
--user $(id -u):$(id -g) \
--volume /mnt/gfs.lich/.config/rclone:/config/rclone \
--volume /mnt/gfs.lich/.rclone-cache:/.rclone-cache \
--volume /mnt/gfs.lich/data:/data:shared \
--volume /etc/passwd:/etc/passwd:ro \
--volume /etc/group:/etc/group:ro \
--device /dev/fuse \
--cap-add SYS_ADMIN \
--security-opt apparmor:unconfined \
rclone/rclone mount gdrive:lich /data/lich \
--cache-dir /.rclone-cache/lich \
--allow-other \
--vfs-cache-mode full \
--vfs-case-insensitive \
--dir-cache-time 1000h \
--vfs-cache-max-age 1000h \
--poll-interval 15s \
--vfs-cache-poll-interval 15s
**^C2023/01/24 20:53:53 ERROR : /data/lich: Failed to unmount: exit status 1: fusermount: failed to unmount /data/lich: Resource busy**
pi@lichswarm1:/mnt/gfs.lich/data $ ls -al
ls: cannot access 'lich': Transport endpoint is not connected
total 0
drwxr-xr-x 3 pi pi  18 Jan 24 15:53 .
drwxr-xr-x 7 pi pi 143 Jan 23 12:13 ..
d????????? ? ?  ?    ?            ? lich
pi@lichswarm1:/mnt/gfs.lich/data $

pi@lichswarm1:/mnt/gfs.lich/data $ docker run --rm --user $(id -u):$(id -g) --volume /mnt/gfs.lich/.config/rclone:/config/rclone --volume /mnt/gfs.lich/.rclone-cache:/.rclone-cache --volume /mnt/gfs.lich/data:/data:shared --volume /etc/passwd:/etc/passwd:ro --volume /etc/group:/etc/group:ro --device /dev/fuse --cap-add SYS_ADMIN --security-opt apparmor:unconfined rclone/rclone mount gdrive:lich /data/lich --cache-dir /.rclone-cache/lich --allow-other --vfs-cache-mode full --vfs-case-insensitive --dir-cache-time 1000h --vfs-cache-max-age 1000h --poll-interval 15s --vfs-cache-poll-interval 15s
**2023/01/24 20:55:45 Fatal error: directory already mounted, use --allow-non-empty to mount anyway: /data/lich**
p

pi@lichswarm1:/mnt/gfs.lich/data $ docker run --rm --user $(id -u):$(id -g) --volume /mnt/gfs.lich/.config/rclone:/config/rclone --volume /mnt/gfs.lich/.rclone-cache:/.rclone-cache --volume /mnt/gfs.lich/data:/data:shared --volume /etc/passwd:/etc/passwd:ro --volume /etc/group:/etc/group:ro --device /dev/fuse --cap-add SYS_ADMIN --security-opt apparmor:unconfined rclone/rclone mount gdrive:lich /data/lich --cache-dir /.rclone-cache/lich --allow-other --vfs-cache-mode full --vfs-case-insensitive --dir-cache-time 1000h --vfs-cache-max-age 1000h --poll-interval 15s --vfs-cache-poll-interval 15s --allow-non-empty
**2023/01/24 20:56:13 mount helper error: fusermount: failed to access mountpoint /data/lich: Socket not connected**
**2023/01/24 20:56:13 Fatal error: failed to mount FUSE fs: fusermount: exit status 1**

You need to look in the rclone log to see what happened.

That's generally, the rclone process was killed and there was still IO on the mount so the mount could not free up.

You'd want to terminate all the IO on the mount before killing the process.

In normal circumstances, agreed. This was for testing for fault tolerance (if the docker container or worker becomes unavailable). The context for this was regarding:

I think in the end I'm just going to need to run rclone bisync regularly, and will probably just build off the rclone docker image and add cron such that it will run that under cron. Then if that container/worker running it goes belly up swarm will handle bring it up on another worker and I think it should be ok.

It's super easy to reproduce that.

[felix@gemini ~]$ cd test
[felix@gemini test]$ ps -ef | grep test | grep rclone
felix    1474893  809279  0 12:14 pts/0    00:00:00 rclone mount GD: /home/felix/test
[felix@gemini test]$ ls
2GB.bin  blah  blah2  crypt  Dupes  Joeisms.docx  linkdir  test  test2  test3  testshare

[felix@gemini test]$ kill 1474893
[felix@gemini test]$ ls
ls: cannot open directory '.': Transport endpoint is not connected
[felix@gemini test]$

If you are sure the processes are down, you can do something like:

[felix@gemini test]$ cd
[felix@gemini ~]$ cd test
-bash: cd: test: Transport endpoint is not connected
[felix@gemini ~]$ fusermount -uz /home/felix/test
[felix@gemini ~]$ cd test
[felix@gemini test]$ ls