macOS FUSE mount contents disappear after writes (while using VFS cache)

Hi,

I ran into the following problem when using rclone FUSE mounts as a back-end for the borg backup program while using VFS-cache. The problem may exist without use of the cache, but I was not able to test for my particular application without use of the VFS cache.

Some background: I can do this backup operation fairly reliably using Google Drive File Stream, so it would be great if we could get things reliable with rclone mount as well.

Here's what I found using rclone with FUSE mounts when trying a borg backup.
I tried different caching strategies, but still have this same basic issue below, as illustrated with a VFS-cache:

PROBLEM:
After successfully writing some smaller files, and then a larger 501 MB file, osxfuse mount's contents dissappear after a few minutes, and borg is confused as to why everything has vaporized.

The basic error seems to be:
"OSError: [Errno 6] Device not configured:"

After the error, the OS still shows the mount point
$ mount
gdrive_team_zzBackups:BorgBackup on /mnt/GD (osxfuse, nodev, nosuid, synchronous, mounted by xxx)

Yet nothing is there any longer:
$ ls /mnt/GD
ls: GD: Device not configured

We can cd into the directory and see nothing is there...
$ cd /mnt/GD
$ ls
.

Note that the kext is still loaded... e.g.
$ kextstat | grep osxfuse
166 0 0xffffff7f84efe000 0x19000 0x19000 com.github.osxfuse.filesystems.osxfuse (3.9.2) DA35320C-B016-3B0A-95A5-859B96D5852C <8 6 5 3 1>

BASIC SYSTEM INFO:
macOS 10.14.5 Mojave (latest at moment of writing)
OSXFuse 3.9.2 (latest at moment writing)
-- I also tried with OSXFuse 3.8.3 to see if the behavior was different. Behavior is same.
rclone v1.48.0

  • os/arch: darwin/amd64
  • go version: go1.12.6

rclone setup using VFS write-cache.

NOT using rclone backend cache for this particular test.
As you can see below I'm basically using default values.
Test appears to be 100% reproducible.

Let's walk through the test...

Mount command contained in script (rclone_mount_VFS_only.sh) below:
#!/bin/sh
/usr/local/bin/rclone mount
gdrive_team_zzBackups:/BorgBackup /mnt/GD
--vfs-cache-mode=writes
--cache-dir=/Volumes/zvol01e_apfs/zRcloneCacheVFS
--log-file /Users/madison/rclone.log
--log-level DEBUG
--debug-fuse \

As you can see both rclone and fuse set for debug logging.

Here we start the mount:

$ date; ./rclone_mount_VFS_only.sh; date
Fri Jun 21 14:25:21 PDT 2019
Fri Jun 21 14:30:47 PDT 2019

This means at:
14:25:21 PDT mount was started.
14:30:47 PDT mount was halted on it's own by fuse -- I did not kill the mount.

rclone log started fresh with this mount invocation.
I can provide the actual log via PM per request.

Initialize borg backup...

$ borg init --encryption=keyfile-blake2 ::

Result: A few files are created in the local VFS cache and it's successfully uploaded to Google Drive (GSuite).
We're seemingly off to a good start.

Now, let's start an actual backup -- I run a script to do this.
Contents can be provided via private message.

$ ./borg_bak_proj_misc_tester.sh

Fri Jun 21 14:26:46 PDT 2019 Starting backup

Approximately 3 minutes later, the borg process halts.

borg errors out with "OSError: [Errno 6] Device not configured: '/mnt/GD/proj_misc_BorgBackup_start_20190621a/lock.roster'", as you can see in the excerpt below from borg's output.

If one tries to do a "ls" of the FUSE mount, e.g. "ls /mnt/GD" there is nothing there.
It's now an "empty" mount. Nothing but a ghost.

Here is the borg output when there is an issue, presumably as it's trying to manage a "lock" file called "lock.roster, which is used by borg's locking system to manage shared and exclusive locks:

====================================================================
Exception ignored in: <bound method Repository.del of <Repository /mnt/GD/proj_misc_BorgBackup_start_20190621a>>
Traceback (most recent call last):
File "borg/repository.py", line 178, in del
File "borg/repository.py", line 427, in close
File "borg/locking.py", line 383, in release
File "borg/locking.py", line 274, in modify
File "borg/locking.py", line 233, in load
OSError: [Errno 6] Device not configured: '/mnt/GD/proj_misc_BorgBackup_start_20190621a/lock.roster'
Local Exception
Traceback (most recent call last):
File "borg/archiver.py", line 166, in wrapper
File "borg/archiver.py", line 573, in do_create
File "borg/archiver.py", line 536, in create_inner
File "borg/archiver.py", line 650, in _process
File "borg/archiver.py", line 650, in _process
File "borg/archiver.py", line 650, in _process
File "borg/archiver.py", line 624, in _process
File "borg/archive.py", line 1063, in process_file
File "borg/archive.py", line 991, in chunk_file
File "borg/archive.py", line 979, in chunk_processor
File "borg/cache.py", line 898, in add_chunk
File "borg/repository.py", line 1101, in put
File "borg/repository.py", line 1502, in write_put
File "borg/repository.py", line 1276, in get_write_fd
File "borg/repository.py", line 1332, in close_segment
File "borg/platform/base.py", line 138, in close
OSError: [Errno 57] Socket is not connected

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "borg/archiver.py", line 4501, in main
File "borg/archiver.py", line 4433, in run
File "borg/archiver.py", line 166, in wrapper
File "borg/repository.py", line 203, in exit
File "borg/repository.py", line 427, in close
File "borg/locking.py", line 383, in release
File "borg/locking.py", line 274, in modify
File "borg/locking.py", line 233, in load
OSError: [Errno 6] Device not configured: '/mnt/GD/proj_misc_BorgBackup_start_20190621a/lock.roster'

Platform: Darwin pov2.leti 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64
Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
PID: 40793 CWD: /Users/madison
sys.argv: ['borg', 'create', '--debug', '--verbose', '--files-cache', 'ctime,size', '--filter', 'AME', '--list', '--stats', '--show-rc', '--compression', 'lz4', '--exclude-caches', '--exclude-from', '/Users/madison/BorgExcludes_proj_misc.txt', '--exclude-if-present', '.nobackup', '--one-file-system', '--checkpoint-interval', '600', '::{hostname}-{user}-{now}', '/Volumes/zvol01e_apfs/Backups/ChronoSync/proj_misc']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

Fri Jun 21 14:29:12 PDT 2019 Pruning repository

Repository /mnt/GD/proj_misc_BorgBackup_start_20190621a does not exist.
terminating with error status, rc 2

Fri Jun 21 14:29:13 PDT 2019 Backup and/or Prune finished with errors

====================================================================

In the meantime, rclone continues to run until it's finished uploading a 501MB repository chunk (filename is "2") to VFS cache and successfully uploaded to Google Drive.

AFTER the transaction is finished (see logs) -- that's when FUSE seems to have a problem.
Excerpt from rclone log that illustrates this:

====================================================================
2019/06/21 14:30:43 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 520093696 length 5331034
2019/06/21 14:30:47 INFO : proj_misc_BorgBackup_start_20190621a/data/0/2: Copied (new)
2019/06/21 14:30:47 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: transferred to remote
2019/06/21 14:30:47 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2(0xc00032cc00): >close: err=
2019/06/21 14:30:47 DEBUG : &{proj_misc_BorgBackup_start_20190621a/data/0/2 (rw)}: >Flush: err=
2019/06/21 14:30:47 DEBUG : fuse: -> [ID=0x8] Flush
2019/06/21 14:30:47 DEBUG : fuse: kernel write error: error="socket is not connected" stack=
goroutine 1566 [running]:
github.com/ncw/rclone/vendor/bazil.org/fuse.stack(...)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/debug.go:9
github.com/ncw/rclone/vendor/bazil.org/fuse.(*Conn).respond(0xc00032d260, 0xc0006a4350, 0x10, 0x10)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go:1113 +0xde
github.com/ncw/rclone/vendor/bazil.org/fuse.(*Header).respond(0xc0003559f0, 0xc0006a4350, 0x10, 0x10)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go:313 +0x61
github.com/ncw/rclone/vendor/bazil.org/fuse.(*FlushRequest).Respond(0xc0003559f0)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go:2092 +0x6c
github.com/ncw/rclone/vendor/bazil.org/fuse/fs.(*Server).handleRequest(0xc0001e40c0, 0x2385960, 0xc003debac0, 0x235e7a0, 0xc000010030, 0xc003f9e680, 0x237b520, 0xc0003559f0, 0xc0016fdef8, 0xc0005c1701, ...)
/Users/travis/gopath/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fs/serve.go:1287 +0x2b90
github.com
2019/06/21 14:30:47 DEBUG : rclone: Version "v1.48.0" finishing with parameters ["/usr/local/bin/rclone" "mount" "gdrive_team_zzBackups:/BorgBackup" "/mnt/GD" "--vfs-cache-mode=writes" "--cache-dir=/Volumes/zvol01e_apfs/zRcloneCacheVFS" "--log-file" "/Users/madison/rclone.log" "--log-level" "DEBUG" "--debug-fuse"]

Ok, and from above see that 14:30:47 is indeed where the mount script (rclone_mount_VFS_only.sh) failed. That's consistent with the activity in the log.

Where borg failed, however at 14:29:12, I don't see too much of note in the rclone log. Excerpt below:

2019/06/21 14:29:05 DEBUG : : Statfs:
2019/06/21 14:29:05 DEBUG : : >Statfs: stat={Blocks:4294967295 Bfree:4294967295 Bavail:4294967295 Files:1000000000 Ffree:1000000000 Bsize:4096 Namelen:255 Frsize:4096}, err=
2019/06/21 14:29:05 DEBUG : fuse: -> [ID=0xa] Statfs blocks=4294967295/4294967295/4294967295 files=1000000000/1000000000 bsize=4096 frsize=4096 namelen=255
2019/06/21 14:29:06 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 276824064 length 8388608
2019/06/21 14:29:09 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 285212672 length 8388608
2019/06/21 14:29:12 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 293601280 length 8388608
2019/06/21 14:29:15 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 301989888 length 8388608
2019/06/21 14:29:19 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 310378496 length 8388608
2019/06/21 14:29:22 DEBUG : Google drive root 'BorgBackup': Checking for changes on remote
2019/06/21 14:29:22 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: updateTime: setting atime to 2019-06-21 14:29:22.279461742 -0700 PDT
2019/06/21 14:29:22 INFO : Cleaned the cache: objects 18 (was 18), total size 501.130M (was 501.130M)
2019/06/21 14:29:22 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 318767104 length 8388608
2019/06/21 14:29:26 DEBUG : proj_misc_BorgBackup_start_20190621a/data/0/2: Sending chunk 327155712 length 8388608

That's the end of the test.

Other considerations:

CPU "wake-ups"
I've seen the macOS complaint about rclone waking up the CPU too much, but haven't been able to correlate this with the above behavior. BTW, I've also seen macOS complain about the same sort of thing from Google Drive File Stream.

"Event: wakeups
Action taken: none
Wakeups: 45001 wakeups over the last 155 seconds (291 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds
"
See also:

The behavior in the test I'm reporting seems to be similar to the following posting:

In other macOS logging, I do see reference to borg in other macOS logs found by opening the mac Console app, and searching for borg, however, these might be false flags (symptoms rather than causes, but I could be wrong). Since there seems to be a macOS complaint about sandbox security issues in the log below, I temporarily disabled SIP (System Integrity Protection) to see if that changed things. No, it did not.

====================================================================
error 14:25:38.016290 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:25:50.384942 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:26:47.395986 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:26:53.919401 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:29:13.580160 -0700 borg nw_path_close_fd Failed to close guarded necp fd 4 [9: Bad file descriptor]
error 14:29:13.950054 -0700 sandboxd Sandbox: borg(40944) System Policy: deny(6) file-read-metadata /mnt/GD/proj_misc_BorgBackup_start_20190621a
Violation: System Policy: deny(6) file-read-metadata /mnt/GD/proj_misc_BorgBackup_start_20190621a
Process: borg [40944]
Path: /usr/local/bin/borg
Load Address: 0x109ecd000
Identifier: borg
Version: ??? (???)
Code Type: x86_64 (Native)
Parent Process: borg [40943]
Responsible: /Applications/Utilities/Terminal.app/Contents/MacOS/Terminal [848]
User ID: 29200

Date/Time: 2019-06-21 14:29:13.940 PDT
OS Version: Mac OS X 10.14.5 (18F132)
Report Version: 8

MetaData: {"target":"/mnt/GD/proj_misc_BorgBackup_start_20190621a","platform-policy":true,"operation":"file-read-metadata","platform_binary":"no","profile-flags":0,"uid":29200,"pid":40944,"normalized_target":["mnt","GD","proj_misc_BorgBackup_start_20190621a"],"platform-binary":false,"profile":"platform","hardware":"Mac","errno":6,"summary":"deny(6) file-read-metadata /mnt/GD/proj_misc_BorgBackup_start_20190621a","primary-filter":"path","process":"borg","action":"deny","path":"/mnt/GD/proj_misc_BorgBackup_start_20190621a","build":"Mac OS X 10.14.5 (18F132)","process-path":"/usr/local/bin/borg","responsible-process-path":"/Applications/Utilities/Terminal.app/Contents/MacOS/Terminal","flags":5,"responsible-process-pid":848,"primary-filter-value":"/mnt/GD/proj_misc_BorgBackup_start_20190621a"}

Thread 0 (id: 198982):
0 libsystem_kernel.dylib 0x00007fff7e55eeda stat$INODE64 + 10
1 Python 0x000000010a7323c1 os_stat + 177
2 Python 0x000000010a66eb5c PyCFunction_Call + 60
3 Python 0x000000010a6e8efb PyEval_EvalFrameEx + 24059
4 Python 0x000000010a6ee0e0 fast_function + 288
5 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
6 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
7 Python 0x000000010a6ee162 fast_function + 418
8 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
9 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
10 Python 0x000000010a6e30de PyEval_EvalCodeEx + 78
11 Python 0x000000010a651170 function_call + 384
12 Python 0x000000010a629261 PyObject_Call + 97
13 Python 0x000000010a63d78d method_call + 141
14 Python 0x000000010a629261 PyObject_Call + 97
15 Python 0x000000010a6243d9 PyObject_CallFunctionObjArgs + 393
16 Python 0x000000010a6e86c6 PyEval_EvalFrameEx + 21958
17 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
18 Python 0x000000010a6ee162 fast_function + 418
19 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
20 Python 0x000000010a6ee0e0 fast_function + 288
21 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
22 Python 0x000000010a6ee0e0 fast_function + 288
23 Python 0x000000010a6e8dc5 PyEval_EvalFrameEx + 23749
24 Python 0x000000010a6ed8bb _PyEval_EvalCodeWithName + 3227
25 Python 0x000000010a6e3081 PyEval_EvalCode + 81
26 borg 0x0000000109ece8d0
27 borg 0x0000000109ecedf9
28 borg 0x0000000109ecd8a4
29 borg 0x000000000000000c

Binary Images:
0x109ecd000 - 0x109ed2ff7 borg <55b994e1-1614-3e85-b704-22c6158bb3d8> /usr/local/bin/borg
0x10a61b000 - 0x10a7c5ff7 Python (0) <053a97b6-179c-356e-acaf-1ec4b8d3c023> /var/folders/64/d3prsg9d4t3dz9g6mtyc5gmh000whh/T/_MEISUDpd6/Python
0x7fff7e55c000 - 0x7fff7e584ff7 libsystem_kernel.dylib (4903.261.4) <7b0f52c4-4cc9-3282-8457-c18c6fe1b99e> /usr/lib/system/libsystem_kernel.dylib

Any ideas appreciated for making this arrangement work. A big thanks to Nick that helped answer questions getting to this point.

Thanks!

-- madison

Further testing without borg, but rather just copying files in the macOS Finder, has made some behaviors apparent.

I am able to copy smaller files (seemingly less < 279 MB) without problem if they have no xattr associated with them, but larger files produce errors, or files with xattr, both of which throw this up in the Finder GUI:
"The operation can’t be completed because an unexpected error occurred (error code 100006)."

When I look up this error code, it's really a POSIX error, corresponding to:
kPOSIXErrorENXIO = 100006, /* Device not configured */

These errors are perhaps caused by the "DesktopServicesHelper" part of the macOS sandbox architecture. For example, by default the sandbox daemon won't let you "file-read-metadata", in this case xattr, on /mnt directories. The behavior seems different if I setup a mount point in a home directory. I'm then able to copy smaller files with xattr.

In addition, mounting the volume as root or enabling "--allow-others" may be necessary to satisfy "DesktopServicesHelper" or "System Policy" enforced by the sandbox daemon (sandboxd).

See the following...
https://osxfuse.github.io/2015/09/25/OSXFUSE-2.8.1.html

I was able to use the rclone mount option: "--enable-others" only after changing the osxfuse admin group to the number of my primary group.

Found that info in the osxfuse mount options:

So, in my case:
$ sudo sysctl -w vfs.generic.osxfuse.tunables.admin_group=29222 # set
vfs.generic.osxfuse.tunables.admin_group: 80 -> 29222

IMPORTANT TO NOTE -- I am able to copy files of large sizes through the command line using Terminal.
These same files (400-500+ MB) I am NOT able to copy through the Finder, which still error out as "error code 100006" as per above. This is despite changing the osxfuse admin group, and keeping the mount point inside my user home directory.

Also, to be clear, the xattr are not copied over to the GDrive destination, but without this setup correctly, it won't enable me to copy the naked file without the extended attributes.

Try setting the daemon timeout parameter for OSXFUSE

  --daemon-timeout duration                Time limit for rclone to respond to kernel (not supported by all OSes).

If this is the problem then it will be fixed by: https://github.com/ncw/rclone/issues/3186

The default for that is off (which means use the default which I think is 30s), but it should probably be 5 minutes under macOS or something like that.

Aha! That fixed it.

I added "--daemon-timeout 30m".

Borg is happily backing up. When it finishes writing its next backup segment (~ 500 MB), it waits for the upload to finish, then continues analyzing/backing up more data, then it waits for the next 500MB to upload. I will have to try a smaller dataset, a formal restore, etc.

If you add the #3186 functionality of having a background upload queue to the VFS cache, yes, that will make this really useful. Borg can then just back-up to it's heart content, writing a lot of local data, and the stuff will be uploaded in the background. You can even start the next day's backup so-to-speak, having a temporary local cache copy, and eventually things will make their way to the cloud until the differentials will be very small.

I will also try the VFS cache + backend cache as an experiment, though I'd hope not to have to run both caches once #3186 arrives.

In the meantime, I can probably use Google Drive FiletSream if the VFS cache + backend cache combo runs into difficulties.

Thanks again. I'll report back as to how some of the other tests go.

BTW, some of the issues I've tackled in the setup for Mac systems should probably be written up in a HOWTO of sorts. Is there a doc anywhere to add stuff to that is mac specific?

Great - glad that works :smiley:

There are some docs about Windows in the mount docs - perhaps a section for macOS would be appropriate?

I think I should also change the default for DaemonTimeout to 5 minutes (say). What do you think?

Would you like to propose a pull request for one or both of those? Or just writing some docs in this thread and I'll paste them into the docs? I think a paragraph or two would be fine. Anything more than that should probably be linked to in the rclone wiki.

Me too. :wink:

I did in fact try a test with VFS cache (write) + backend-cache, and it worked for a while, but then eventually ran into difficulties. I'll put testing of that aside for the moment at least, and try and complete some more smaller tests with borg using VFS cache only to see if it can fully complete and restore some backups and report back. I was not able to do much of this over the past couple of days.

About the daemon timeout -- I think something larger than 5 minutes might be in order.
For example, say you were transferring a 500MB file over a 10 mbit upstream connection (my upstream is faster than that), but that might be typical for a random user. This transfer would take more than 6 minutes to complete. So, my gut says at least 10 minutes, but you have more experience with your users' use-cases than me.

I'll start writing up some notes for macOS as regards mount, and then let's see what it looks like -- if more appropriate for wiki or not. Thanks for your patience!

Yes you are probably right...

Can you try this which should have the default set to 15 minutes for macOS. If you run rclone mount -h | grep daemon you should see it there.

https://beta.rclone.org/branch/v1.48.0-011-g704201d1-fix-daemon-timeout-beta/ (uploaded in 15-30 mins)

Great thanks!

Sure, Nick, I will try that build today, and get back to you.

Yes, indeed "rclone v1.48.0-011-g704201d1-fix-daemon-timeout-beta" works.

I thought this was going to be a quick test, and it was hard to even remember the behavior under the various conditions -- terminal, finder, etc.

BTW, as we probably already know:

In macOS Terminal:
Using NO cache (VFS or backend) with current release -- smaller daemon timeout default -- (v1.48.0):
Can successfully copy large (> 500 MB) files.

Using VFS cache with current release -- smaller daemon timeout default -- (v1.48.0).
Can NOT copy large (>500 MB) files.

Using VFS cache with beta (v1.48.0-011-g704201d1-fix-daemon-timeout-beta) with the larger default timeout:
Can successfully copy large (> 500 MB) files without changing the default.

Also, in the Finder, with VFS cache enabled using the beta, I'm able to do 500+MB copies.
With the release, those copies wind up with the "error code 100006" message.

Thanks again!

Thank you for testing :smiley:

I've merged this to master now which means it will be in the latest beta in 15-30 mins and released in v1.49

I'll merge any notes on macOS separately.

Great, thanks again.

I think it's gonna take me a little while to write something up, and I'm sure I'll have some questions for yourself or other macOS users on this forum.

For example, I had to yet again setup myself as member of the osxfuse admin group in order to use "--allow-others" and successfully mount the volume. Perhaps the way I did it was not persistent across reboots? Will need to test. Either that, or the mount had to be done as root, which brings other issues with it.

In any case, I'm looking forward to getting a lot of these gotchas outta the way for myself and others.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.