My rclone mount is dropping

Next steps would be to figure out why systemd seems to be stopping rclone based on your information so far.

You can add:

KillMode=none
RestartSec=5

That makes systemd not SIGKILL rclone and limits the number of restarts per second. I'm on Ubuntu so /etc/systemd/system.conf has log level of info by default and you can change that to debug and see what's going on.

You should get some logging like this:

Jul 14 17:24:25 gemini systemd[1]: Stopping RClone Service...
Jul 14 17:24:25 gemini systemd[1]: test.service: User lookup succeeded: uid=1000 gid=1000
Jul 14 17:24:25 gemini systemd[27410]: test.service: Executing: /bin/fusermount -uz /Test
Jul 14 17:24:25 gemini systemd[1]: test.service: Child 27410 belongs to test.service.
Jul 14 17:24:25 gemini systemd[1]: test.service: Control process exited, code=exited, status=0/SUCCESS
Jul 14 17:24:25 gemini systemd[1]: test.service: Got final SIGCHLD for state stop.
Jul 14 17:24:25 gemini systemd[1]: test.service: Changed stop -> stop-sigterm
Jul 14 17:24:25 gemini systemd[1]: test.service: Got notification message from PID 27371 (STOPPING=1)
Jul 14 17:24:25 gemini systemd[1]: test.service: Child 27371 belongs to test.service.
Jul 14 17:24:25 gemini systemd[1]: test.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 17:24:25 gemini systemd[1]: test.service: Failed with result 'exit-code'.
Jul 14 17:24:25 gemini systemd[1]: test.service: Changed stop-sigterm -> failed
Jul 14 17:24:25 gemini systemd[1]: test.service: Job 4186 test.service/stop finished, result=done
Jul 14 17:24:25 gemini systemd[1]: Stopped RClone Service.
Jul 14 17:24:25 gemini systemd[1]: test.service: Unit entered failed state.

In your journalctl for that service but based on your logs so far, it seems something SIGKILL'ed rclone and that causes systemd to restart it and it fails because of the SIGKILL as the mount isn't actually unmounted.

I think plexdive has a teamdrive fix if you build from source as well. Use the best tool for your use case.

This part -> https://github.com/dweidenfeld/plexdrive/commit/bf61a476178d450e969bbf8fcc910c4c4c2aa25b

I don't know how to build a go application, I tried but it seems to have a lot of go dependencies I have no idea how to get lol.

But thanks for the support

Ok I was able to reproduce the issue again.

My script runs ffprobe on ~4k files to check they are OK.

I changed my systemd to debug, and have debug logs. And this is what I see with systemctl status vfs

● vfs.service - Rclone VFS Mount
Loaded: loaded (/etc/systemd/system/vfs.service; enabled; vendor preset: enabled)
Active: failed (Result: signal) since Tue 2019-09-10 06:59:09 UTC; 47min ago
Process: 17146 ExecStop=/bin/fusermount -uz /mnt/remote (code=exited, status=0/SUCCESS)
Process: 22068 ExecStart=/usr/bin/rclone mount --user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 --config=/home/xtre
Process: 21931 ExecStartPre=/bin/sleep 10 (code=exited, status=0/SUCCESS)
Main PID: 22068 (code=killed, signal=KILL)

Sep 10 05:26:50 L1 systemd[1]: vfs.service: Got notification message from PID 22068 (READY=1)
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Changed start -> running
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Job vfs.service/start finished, result=done
Sep 10 05:26:50 L1 systemd[1]: Started Rclone VFS Mount.
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Failed to send unit change signal for vfs.service: Connection reset by peer
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Child 22068 belongs to vfs.service.
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Main process exited, code=killed, status=9/KILL
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Failed with result 'signal'.
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Changed running -> failed
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Unit entered failed state.
...skipping...
● vfs.service - Rclone VFS Mount
Loaded: loaded (/etc/systemd/system/vfs.service; enabled; vendor preset: enabled)
Active: failed (Result: signal) since Tue 2019-09-10 06:59:09 UTC; 47min ago
Process: 17146 ExecStop=/bin/fusermount -uz /mnt/remote (code=exited, status=0/SUCCESS)
Process: 22068 ExecStart=/usr/bin/rclone mount --user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36 --config=/home/xtre
Process: 21931 ExecStartPre=/bin/sleep 10 (code=exited, status=0/SUCCESS)
Main PID: 22068 (code=killed, signal=KILL)

Sep 10 05:26:50 L1 systemd[1]: vfs.service: Got notification message from PID 22068 (READY=1)
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Changed start -> running
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Job vfs.service/start finished, result=done
Sep 10 05:26:50 L1 systemd[1]: Started Rclone VFS Mount.
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Failed to send unit change signal for vfs.service: Connection reset by peer
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Child 22068 belongs to vfs.service.
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Main process exited, code=killed, status=9/KILL
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Failed with result 'signal'.
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Changed running -> failed
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Unit entered failed state.

And looking at my rclone debug logs I see same behavior...everything is fine, then rclone just stops without writing anything.

I have this in my systemd after execstart:

ExecStop=/bin/fusermount -uz /mnt/remote
KillMode=none
RestartSec=5

Why would systemd kill rclone?

Can you share the full log with the journalctl command?

The status is only showing when it tried to start it again.

Too late now, they are far gone.

I'll reproduce the issue later.

All I can say is that changing to plexdrive, I can run the same script and my mount don't break.

What would be the journalctl command?

It would look like this:

felix@gemini:~$ journalctl -u rclone
-- Logs begin at Fri 2019-08-30 07:26:42 EDT, end at Tue 2019-09-10 12:46:03 EDT. --
Sep 06 12:25:02 gemini systemd[1]: Starting RClone Service...
Sep 06 12:25:03 gemini systemd[1]: Started RClone Service.
Sep 08 07:27:58 gemini systemd[1]: rclone.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Sep 08 19:12:31 gemini systemd[1]: Stopping RClone Service...
Sep 08 19:12:31 gemini systemd[1]: Stopped RClone Service.
-- Reboot --
Sep 08 19:13:00 gemini systemd[1]: Starting RClone Service...
Sep 08 19:13:01 gemini systemd[1]: Started RClone Service.
Sep 09 14:23:45 gemini systemd[1]: Stopping RClone Service...
Sep 09 14:23:45 gemini systemd[1]: Stopped RClone Service.
-- Reboot --
Sep 09 14:24:29 gemini systemd[1]: Starting RClone Service...
Sep 09 14:24:31 gemini systemd[1]: Started RClone Service.
Sep 09 14:30:51 gemini systemd[1]: Stopping RClone Service...
Sep 09 14:30:51 gemini systemd[1]: Stopped RClone Service.
-- Reboot --
Sep 09 14:31:21 gemini systemd[1]: Starting RClone Service...
Sep 09 14:31:22 gemini systemd[1]: Started RClone Service.
Sep 09 15:39:02 gemini systemd[1]: Stopping RClone Service...
Sep 09 15:39:02 gemini fusermount[18044]: /bin/fusermount: entry for /GD not found in /etc/mtab
Sep 09 15:39:02 gemini systemd[1]: rclone.service: Control process exited, code=exited status=1
Sep 09 15:39:02 gemini systemd[1]: rclone.service: Failed with result 'exit-code'.
Sep 09 15:39:02 gemini systemd[1]: Stopped RClone Service.
-- Reboot --
Sep 09 15:39:52 gemini systemd[1]: Starting RClone Service...
Sep 09 15:39:53 gemini systemd[1]: Started RClone Service.
-- Reboot --
Sep 10 06:56:19 gemini systemd[1]: Starting RClone Service...
Sep 10 06:56:20 gemini systemd[1]: Started RClone Service.

My service is named rclone.service.

Why would the logs be gone?

ffprobe on 4k is really nothing. I've stressed my mount with much more than that and the fact you aren't crashing rclone by your logs, it's got to be something else.

The else is what has me curious.

My service is called vfs so I did journalctl -r -u vfs

See if you can grab anything useful, I'm getting Sep 10 output only,

Sep 10 07:53:54 L1 systemd[1]: vfs.service: Job vfs.service/stop finished, result=done
Sep 10 07:53:54 L1 systemd[1]: vfs.service: Enqueued job vfs.service/stop as 207803
Sep 10 07:53:54 L1 systemd[1]: vfs.service: Installed new job vfs.service/stop as 207803
Sep 10 07:53:54 L1 systemd[1]: vfs.service: Trying to enqueue job vfs.service/stop/replace
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Unit entered failed state.
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Changed running -> failed
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Failed with result 'signal'.
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Main process exited, code=killed, status=9/KILL
Sep 10 06:59:09 L1 systemd[1]: vfs.service: Child 22068 belongs to vfs.service.
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Failed to send unit change signal for vfs.service: Connection reset by peer
Sep 10 05:26:50 L1 systemd[1]: Started Rclone VFS Mount.
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Job vfs.service/start finished, result=done
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Changed start -> running
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Got notification message from PID 22068 (READY=1)
Sep 10 05:26:50 L1 systemd[22068]: vfs.service: Executing: /usr/bin/rclone mount '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131
Sep 10 05:26:50 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Changed start-pre -> start
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Forked /usr/bin/rclone as 22068
Sep 10 05:26:50 L1 systemd[1]: vfs.service: About to execute: /usr/bin/rclone mount '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.1
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Passing 0 fds to service
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Got final SIGCHLD for state start-pre.
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Control process exited, code=exited status=0
Sep 10 05:26:50 L1 systemd[1]: vfs.service: Child 21931 belongs to vfs.service.
Sep 10 05:26:40 L1 systemd[21931]: vfs.service: Executing: /bin/sleep 10
Sep 10 05:26:40 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:26:40 L1 systemd[1]: Starting Rclone VFS Mount...
Sep 10 05:26:40 L1 systemd[1]: vfs.service: Changed failed -> start-pre
Sep 10 05:26:40 L1 systemd[1]: vfs.service: Forked /bin/sleep as 21931
Sep 10 05:26:40 L1 systemd[1]: vfs.service: About to execute: /bin/sleep 10
Sep 10 05:26:40 L1 systemd[1]: vfs.service: Enqueued job vfs.service/start as 207683
Sep 10 05:26:40 L1 systemd[1]: vfs.service: Installed new job vfs.service/start as 207683
Sep 10 05:26:40 L1 systemd[1]: vfs.service: Trying to enqueue job vfs.service/start/replace
Sep 10 05:24:08 L1 systemd[1]: vfs.service: cgroup is empty
Sep 10 05:24:08 L1 systemd[1]: vfs.service: Child 22069 belongs to vfs.service.
Sep 10 05:24:08 L1 systemd[1]: vfs.service: Got notification message from PID 22069, but reception only permitted for main PID which is currently not known
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Unit entered failed state.
Sep 10 05:23:42 L1 systemd[1]: Failed to start Rclone VFS Mount.
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Job vfs.service/start finished, result=failed
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Failed to destroy cgroup /system.slice/vfs.service, ignoring: Device or resource busy
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Changed start -> failed
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Failed with result 'exit-code'.
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Main process exited, code=exited, status=1/FAILURE
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Child 18243 belongs to vfs.service.
Sep 10 05:23:42 L1 systemd[18243]: vfs.service: Executing: /usr/bin/rclone mount '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131
Sep 10 05:23:42 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Changed start-pre -> start
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Forked /usr/bin/rclone as 18243
Sep 10 05:23:42 L1 systemd[1]: vfs.service: About to execute: /usr/bin/rclone mount '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.1
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Passing 0 fds to service
Sep 10 05:23:42 L1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Found left-over process 22069 (rclone) in control group while starting unit. Ignoring.
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Got final SIGCHLD for state start-pre.
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Control process exited, code=exited status=0
Sep 10 05:23:42 L1 systemd[1]: vfs.service: Child 18149 belongs to vfs.service.
Sep 10 05:23:32 L1 systemd[18149]: vfs.service: Executing: /bin/sleep 10
Sep 10 05:23:32 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:23:32 L1 systemd[1]: Starting Rclone VFS Mount...
Sep 10 05:23:32 L1 systemd[1]: vfs.service: Changed failed -> start-pre
Sep 10 05:23:32 L1 systemd[1]: vfs.service: Forked /bin/sleep as 18149
Sep 10 05:23:32 L1 systemd[1]: vfs.service: About to execute: /bin/sleep 10
Sep 10 05:23:32 L1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 10 05:23:32 L1 systemd[1]: vfs.service: Found left-over process 22069 (rclone) in control group while starting unit. Ignoring.
Sep 10 05:23:32 L1 systemd[1]: vfs.service: Enqueued job vfs.service/start as 207618
Sep 10 05:23:32 L1 systemd[1]: vfs.service: Installed new job vfs.service/start as 207618
Sep 10 05:23:32 L1 systemd[1]: vfs.service: Trying to enqueue job vfs.service/start/replace
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Unit entered failed state.
Sep 10 05:22:19 L1 systemd[1]: Failed to start Rclone VFS Mount.
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Job vfs.service/start finished, result=failed
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Failed to destroy cgroup /system.slice/vfs.service, ignoring: Device or resource busy
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Changed start -> failed
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Failed with result 'exit-code'.
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Main process exited, code=exited, status=1/FAILURE
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Child 17295 belongs to vfs.service.
Sep 10 05:22:19 L1 systemd[17295]: vfs.service: Executing: /usr/bin/rclone mount '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131
Sep 10 05:22:19 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Changed start-pre -> start
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Forked /usr/bin/rclone as 17295
Sep 10 05:22:19 L1 systemd[1]: vfs.service: About to execute: /usr/bin/rclone mount '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.1
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Passing 0 fds to service
Sep 10 05:22:19 L1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Found left-over process 22069 (rclone) in control group while starting unit. Ignoring.
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Got final SIGCHLD for state start-pre.
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Control process exited, code=exited status=0
Sep 10 05:22:19 L1 systemd[1]: vfs.service: Child 17202 belongs to vfs.service.
Sep 10 05:22:09 L1 systemd[17202]: vfs.service: Executing: /bin/sleep 10
Sep 10 05:22:09 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:22:09 L1 systemd[1]: Starting Rclone VFS Mount...
Sep 10 05:22:09 L1 systemd[1]: vfs.service: Changed dead -> start-pre
Sep 10 05:22:09 L1 systemd[1]: vfs.service: Forked /bin/sleep as 17202
Sep 10 05:22:09 L1 systemd[1]: vfs.service: About to execute: /bin/sleep 10
Sep 10 05:22:09 L1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 10 05:22:09 L1 systemd[1]: vfs.service: Found left-over process 22069 (rclone) in control group while starting unit. Ignoring.
Sep 10 05:22:09 L1 systemd[1]: vfs.service: Enqueued job vfs.service/start as 207558
Sep 10 05:22:09 L1 systemd[1]: vfs.service: Installed new job vfs.service/start as 207558
Sep 10 05:22:09 L1 systemd[1]: vfs.service: Trying to enqueue job vfs.service/start/replace
Sep 10 05:22:04 L1 systemd[1]: Stopped Rclone VFS Mount.
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Job vfs.service/stop finished, result=done
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Failed to destroy cgroup /system.slice/vfs.service, ignoring: Device or resource busy
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Changed stop -> dead
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Got final SIGCHLD for state stop.
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Control process exited, code=exited status=0
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Child 17146 belongs to vfs.service.
Sep 10 05:22:04 L1 systemd[17146]: vfs.service: Executing: /bin/fusermount -uz /mnt/remote
Sep 10 05:22:04 L1 systemd[1]: vfs.service: User lookup succeeded: uid=1000 gid=1000
Sep 10 05:22:04 L1 systemd[1]: Stopping Rclone VFS Mount...
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Changed running -> stop
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Forked /bin/fusermount as 17146
Sep 10 05:22:04 L1 systemd[1]: vfs.service: About to execute: /bin/fusermount -uz /mnt/remote
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Enqueued job vfs.service/stop as 207557
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Installed new job vfs.service/stop as 207557
Sep 10 05:22:04 L1 systemd[1]: vfs.service: Trying to enqueue job vfs.service/stop/replace
Sep 10 05:19:14 L1 systemd[1]: vfs.service: Unknown serialization key: ref-gid

The issue is I have too many logs at this point, I probably should wipe everything and crash it again,

I think the lines just below that are your key as a person/process/something issues a stop on rclone.

Something 'stopped' it based on that log.

What's the current service file you are running with that service?

(Sorry, I pasted the wrong line in the first time and had to edit my post)

It could be me stopping / starting it after it crashed? Hard to know...

[Unit]
Description=Rclone VFS Mount
After=network-online.target

[Service]
User=user
Group=user
Type=notify
ExecStartPre=/bin/sleep 10
ExecStart=/usr/bin/rclone mount
--user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
--config=/home/user/.config/rclone/rclone.conf
--allow-other
--rc
--rc-addr=localhost:5572
--drive-skip-gdocs
--vfs-read-chunk-size=10M
--vfs-read-chunk-size-limit=0
--buffer-size=20M
--poll-interval=1m
--no-modtime
--dir-cache-time=24h
--timeout=10m
--drive-chunk-size=64M
--umask=002
--log-level DEBUG
--log-file /opt/rclone/logs/rclone.log
1: /mnt/remote
ExecStop=/bin/fusermount -uz /mnt/remote
KillMode=none
RestartSec=5

[Install]
WantedBy=default.target

If you can timestamp when you are starting it, that might be helpful as well.

The only thing that I've seen that causes a systemd to stop otherwise would be a require.

Like I do a requires so my torrent stuff stops if my mount goes away:

Requires=gmedia.service
After=network-online.target gmedia.service

But you don't have that in your service file so nothing should stop it.

It is indeed a mystery...

I cleaned my logs with:

journalctl --rotate
journalctl --vacuum-time=1s

In about 12 hours I can try to crash the mount again, with empty logs, it will easier to find whats happening

I feel like if we find the end of this tunnel, we should play the lottery.