Hi,
I ran into the following problem when using rclone FUSE mounts as a back-end for the borg backup program while using VFS-cache. The problem may exist without use of the cache, but I was not able to test for my particular application without use of the VFS cache.
Some background: I can do this backup operation fairly reliably using Google Drive File Stream, so it would be great if we could get things reliable with rclone mount as well.
Here's what I found using rclone with FUSE mounts when trying a borg backup.
I tried different caching strategies, but still have this same basic issue below, as illustrated with a VFS-cache:
PROBLEM:
After successfully writing some smaller files, and then a larger 501 MB file, osxfuse mount's contents dissappear after a few minutes, and borg is confused as to why everything has vaporized.
The basic error seems to be:
"OSError: [Errno 6] Device not configured:"
After the error, the OS still shows the mount point
$ mount
gdrive_team_zzBackups:BorgBackup on /mnt/GD (osxfuse, nodev, nosuid, synchronous, mounted by xxx)
Yet nothing is there any longer:
$ ls /mnt/GD
ls: GD: Device not configured
We can cd into the directory and see nothing is there...
$ cd /mnt/GD
$ ls
.
Note that the kext is still loaded... e.g.
$ kextstat | grep osxfuse
166 0 0xffffff7f84efe000 0x19000 0x19000 com.github.osxfuse.filesystems.osxfuse (3.9.2) DA35320C-B016-3B0A-95A5-859B96D5852C <8 6 5 3 1>
BASIC SYSTEM INFO:
macOS 10.14.5 Mojave (latest at moment of writing)
OSXFuse 3.9.2 (latest at moment writing)
-- I also tried with OSXFuse 3.8.3 to see if the behavior was different. Behavior is same.
rclone v1.48.0
- os/arch: darwin/amd64
- go version: go1.12.6
rclone setup using VFS write-cache.
NOT using rclone backend cache for this particular test.
As you can see below I'm basically using default values.
Test appears to be 100% reproducible.
Let's walk through the test...
Mount command contained in script (rclone_mount_VFS_only.sh) below:
#!/bin/sh
/usr/local/bin/rclone mount
gdrive_team_zzBackups:/BorgBackup /mnt/GD
--vfs-cache-mode=writes
--cache-dir=/Volumes/zvol01e_apfs/zRcloneCacheVFS
--log-file /Users/madison/rclone.log
--log-level DEBUG
--debug-fuse \
As you can see both rclone and fuse set for debug logging.
Here we start the mount:
$ date; ./rclone_mount_VFS_only.sh; date
Fri Jun 21 14:25:21 PDT 2019
Fri Jun 21 14:30:47 PDT 2019
This means at:
14:25:21 PDT mount was started.
14:30:47 PDT mount was halted on it's own by fuse -- I did not kill the mount.
rclone log started fresh with this mount invocation.
I can provide the actual log via PM per request.
Initialize borg backup...
$ borg init --encryption=keyfile-blake2 ::
Result: A few files are created in the local VFS cache and it's successfully uploaded to Google Drive (GSuite).
We're seemingly off to a good start.
Now, let's start an actual backup -- I run a script to do this.
Contents can be provided via private message.
$ ./borg_bak_proj_misc_tester.sh
Fri Jun 21 14:26:46 PDT 2019 Starting backup
Approximately 3 minutes later, the borg process halts.
borg errors out with "OSError: [Errno 6] Device not configured: '/mnt/GD/proj_misc_BorgBackup_start_20190621a/lock.roster'", as you can see in the excerpt below from borg's output.
If one tries to do a "ls" of the FUSE mount, e.g. "ls /mnt/GD" there is nothing there.
It's now an "empty" mount. Nothing but a ghost.
Here is the borg output when there is an issue, presumably as it's trying to manage a "lock" file called "lock.roster, which is used by borg's locking system to manage shared and exclusive locks:
====================================================================
Exception ignored in: <bound method Repository.del of <Repository /mnt/GD/proj_misc_BorgBackup_start_20190621a>>
Traceback (most recent call last):
File "borg/repository.py", line 178, in del
File "borg/repository.py", line 427, in close
File "borg/locking.py", line 383, in release
File "borg/locking.py", line 274, in modify
File "borg/locking.py", line 233, in load
OSError: [Errno 6] Device not configured: '/mnt/GD/proj_misc_BorgBackup_start_20190621a/lock.roster'
Local Exception
Traceback (most recent call last):
File "borg/archiver.py", line 166, in wrapper
File "borg/archiver.py", line 573, in do_create
File "borg/archiver.py", line 536, in create_inner
File "borg/archiver.py", line 650, in _process
File "borg/archiver.py", line 650, in _process
File "borg/archiver.py", line 650, in _process
File "borg/archiver.py", line 624, in _process
File "borg/archive.py", line 1063, in process_file
File "borg/archive.py", line 991, in chunk_file
File "borg/archive.py", line 979, in chunk_processor
File "borg/cache.py", line 898, in add_chunk
File "borg/repository.py", line 1101, in put
File "borg/repository.py", line 1502, in write_put
File "borg/repository.py", line 1276, in get_write_fd
File "borg/repository.py", line 1332, in close_segment
File "borg/platform/base.py", line 138, in close
OSError: [Errno 57] Socket is not connected
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "borg/archiver.py", line 4501, in main
File "borg/archiver.py", line 4433, in run
File "borg/archiver.py", line 166, in wrapper
File "borg/repository.py", line 203, in exit
File "borg/repository.py", line 427, in close
File "borg/locking.py", line 383, in release
File "borg/locking.py", line 274, in modify
File "borg/locking.py", line 233, in load
OSError: [Errno 6] Device not configured: '/mnt/GD/proj_misc_BorgBackup_start_20190621a/lock.roster'
Platform: Darwin pov2.leti 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64
Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
PID: 40793 CWD: /Users/madison
sys.argv: ['borg', 'create', '--debug', '--verbose', '--files-cache', 'ctime,size', '--filter', 'AME', '--list', '--stats', '--show-rc', '--compression', 'lz4', '--exclude-caches', '--exclude-from', '/Users/madison/BorgExcludes_proj_misc.txt', '--exclude-if-present', '.nobackup', '--one-file-system', '--checkpoint-interval', '600', '::{hostname}-{user}-{now}', '/Volumes/zvol01e_apfs/Backups/ChronoSync/proj_misc']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 2
Fri Jun 21 14:29:12 PDT 2019 Pruning repository
Repository /mnt/GD/proj_misc_BorgBackup_start_20190621a does not exist.
terminating with error status, rc 2
Fri Jun 21 14:29:13 PDT 2019 Backup and/or Prune finished with errors
====================================================================
In the meantime, rclone continues to run until it's finished uploading a 501MB repository chunk (filename is "2") to VFS cache and successfully uploaded to Google Drive.
AFTER the transaction is finished (see logs) -- that's when FUSE seems to have a problem.
Excerpt from rclone log that illustrates this:
====================================================================
2019/06/21 14:30:43 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 520093696 length 5331034
2019/06/21 14:30:47 INFO : proj_misc_BorgBackup_start_20190621a/data/0/2: Copied (new)
2019/06/21 14:30:47 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: transferred to remote
2019/06/21 14:30:47 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2(0xc00032cc00): >close: err=
2019/06/21 14:30:47 DEBUG : &{proj_misc_BorgBackup_start_20190621a/data/0/2 (rw)}: >Flush: err=
2019/06/21 14:30:47 DEBUG : fuse: -> [ID=0x8] Flush
2019/06/21 14:30:47 DEBUG : fuse: kernel write error: error="socket is not connected" stack=
goroutine 1566 [running]:
github.com/ncw/rclone/vendor/bazil.org/fuse.stack(...)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/debug.go:9
github.com/ncw/rclone/vendor/bazil.org/fuse.(*Conn).respond(0xc00032d260, 0xc0006a4350, 0x10, 0x10)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go:1113 +0xde
github.com/ncw/rclone/vendor/bazil.org/fuse.(*Header).respond(0xc0003559f0, 0xc0006a4350, 0x10, 0x10)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go:313 +0x61
github.com/ncw/rclone/vendor/bazil.org/fuse.(*FlushRequest).Respond(0xc0003559f0)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go:2092 +0x6c
github.com/ncw/rclone/vendor/bazil.org/fuse/fs.(*Server).handleRequest(0xc0001e40c0, 0x2385960, 0xc003debac0, 0x235e7a0, 0xc000010030, 0xc003f9e680, 0x237b520, 0xc0003559f0, 0xc0016fdef8, 0xc0005c1701, ...)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fs/serve.go:1287 +0x2b90
github.com
2019/06/21 14:30:47 DEBUG : rclone: Version "v1.48.0" finishing with parameters ["/usr/local/bin/rclone" "mount" "gdrive_team_zzBackups:/BorgBackup" "/mnt/GD" "--vfs-cache-mode=writes" "--cache-dir=/Volumes/zvol01e_apfs/zRcloneCacheVFS" "--log-file" "/Users/madison/rclone.log" "--log-level" "DEBUG" "--debug-fuse"]
Ok, and from above see that 14:30:47 is indeed where the mount script (rclone_mount_VFS_only.sh) failed. That's consistent with the activity in the log.
Where borg failed, however at 14:29:12, I don't see too much of note in the rclone log. Excerpt below:
2019/06/21 14:29:05 DEBUG : : Statfs:
2019/06/21 14:29:05 DEBUG : : >Statfs: stat={Blocks:4294967295 Bfree:4294967295 Bavail:4294967295 Files:1000000000 Ffree:1000000000 Bsize:4096 Namelen:255 Frsize:4096}, err=
2019/06/21 14:29:05 DEBUG : fuse: -> [ID=0xa] Statfs blocks=4294967295/4294967295/4294967295 files=1000000000/1000000000 bsize=4096 frsize=4096 namelen=255
2019/06/21 14:29:06 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 276824064 length 8388608
2019/06/21 14:29:09 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 285212672 length 8388608
2019/06/21 14:29:12 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 293601280 length 8388608
2019/06/21 14:29:15 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 301989888 length 8388608
2019/06/21 14:29:19 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 310378496 length 8388608
2019/06/21 14:29:22 DEBUG : Google drive root 'BorgBackup': Checking for changes on remote
2019/06/21 14:29:22 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: updateTime: setting atime to 2019-06-21 14:29:22.279461742 -0700 PDT
2019/06/21 14:29:22 INFO : Cleaned the cache: objects 18 (was 18), total size 501.130M (was 501.130M)
2019/06/21 14:29:22 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 318767104 length 8388608
2019/06/21 14:29:26 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 327155712 length 8388608
That's the end of the test.
Other considerations:
CPU "wake-ups"
I've seen the macOS complaint about rclone waking up the CPU too much, but haven't been able to correlate this with the above behavior. BTW, I've also seen macOS complain about the same sort of thing from Google Drive File Stream.
"Event: wakeups
Action taken: none
Wakeups: 45001 wakeups over the last 155 seconds (291 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds
"
See also:
The behavior in the test I'm reporting seems to be similar to the following posting:
In other macOS logging, I do see reference to borg in other macOS logs found by opening the mac Console app, and searching for borg, however, these might be false flags (symptoms rather than causes, but I could be wrong). Since there seems to be a macOS complaint about sandbox security issues in the log below, I temporarily disabled SIP (System Integrity Protection) to see if that changed things. No, it did not.
====================================================================
error 14:25:38.016290 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:25:50.384942 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:26:47.395986 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:26:53.919401 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:29:13.580160 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:29:13.950054 -0700 sandboxd Sandbox: borg(40944) System Policy: deny(6) file-read-metadata /mnt/GD/proj_misc_BorgBackup_start_20190621a
Violation: System Policy: deny(6) file-read-metadata /mnt/GD/proj_misc_BorgBackup_start_20190621a
Process: borg [40944]
Path: /usr/local/bin/borg
Load Address: 0x109ecd000
Identifier: borg
Version: ??? (???)
Code Type: x86_64 (Native)
Parent Process: borg [40943]
Responsible: /Applications/Utilities/Terminal.app/Contents/MacOS/Terminal [848]
User ID: 29200
Date/Time: 2019-06-21 14:29:13.940 PDT
OS Version: Mac OS X 10.14.5 (18F132)
Report Version: 8
MetaData: {"target":"/mnt/GD/proj_misc_BorgBackup_start_20190621a","platform-policy":true,"operation":"file-read-metadata","platform_binary":"no","profile-flags":0,"uid":29200,"pid":40944,"normalized_target":["mnt","GD","proj_misc_BorgBackup_start_20190621a"],"platform-binary":false,"profile":"platform","hardware":"Mac","errno":6,"summary":"deny(6) file-read-metadata /mnt/GD/proj_misc_BorgBackup_start_20190621a","primary-filter":"path","process":"borg","action":"deny","path":"/mnt/GD/proj_misc_BorgBackup_start_20190621a","build":"Mac OS X 10.14.5 (18F132)","process-path":"/usr/local/bin/borg","responsible-process-path":"/Applications/Utilities/Terminal.app/Contents/MacOS/Terminal","flags":5,"responsible-process-pid":848,"primary-filter-value":"/mnt/GD/proj_misc_BorgBackup_start_20190621a"}
Thread 0 (id: 198982):
0 libsystem_kernel.dylib 0x00007fff7e55eeda stat$INODE64 + 10
1 Python 0x000000010a7323c1 os_stat + 177
2 Python 0x000000010a66eb5c PyCFunction_Call + 60
3 Python 0x000000010a6e8efb PyEval_EvalFrameEx + 24059
4 Python 0x000000010a6ee0e0 fast_function + 288
5 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
6 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
7 Python 0x000000010a6ee162 fast_function + 418
8 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
9 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
10 Python 0x000000010a6e30de PyEval_EvalCodeEx + 78
11 Python 0x000000010a651170 function_call + 384
12 Python 0x000000010a629261 PyObject_Call + 97
13 Python 0x000000010a63d78d method_call + 141
14 Python 0x000000010a629261 PyObject_Call + 97
15 Python 0x000000010a6243d9 PyObject_CallFunctionObjArgs + 393
16 Python 0x000000010a6e86c6 PyEval_EvalFrameEx + 21958
17 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
18 Python 0x000000010a6ee162 fast_function + 418
19 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
20 Python 0x000000010a6ee0e0 fast_function + 288
21 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
22 Python 0x000000010a6ee0e0 fast_function + 288
23 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
24 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
25 Python 0x000000010a6e3081 PyEval_EvalCode + 81
26 borg 0x0000000109ece8d0
27 borg 0x0000000109ecedf9
28 borg 0x0000000109ecd8a4
29 borg 0x000000000000000c
Binary Images:
0x109ecd000 - 0x109ed2ff7 borg <55b994e1-1614-3e85-b704-22c6158bb3d8> /usr/local/bin/borg
0x10a61b000 - 0x10a7c5ff7 Python (0) <053a97b6-179c-356e-acaf-1ec4b8d3c023> /var/folders/64/d3prsg9d4t3dz9g6mtyc5gmh000whh/T/_MEISUDpd6/Python
0x7fff7e55c000 - 0x7fff7e584ff7 libsystem_kernel.dylib (4903.261.4) <7b0f52c4-4cc9-3282-8457-c18c6fe1b99e> /usr/lib/system/libsystem_kernel.dylib
Any ideas appreciated for making this arrangement work. A big thanks to Nick that helped answer questions getting to this point.
Thanks!
-- madison