Mount over sftp and --vfs-cache-mode full causing data corruption

What is the problem you are having with rclone?

When mounting SSH/SFTP I occasionally see my files cannot be written or read anymore and they end up being corrupted. Especially when tagging mp3 files that were read ahead and loaded by a tool. I've been experimenting with --vfs-read-ahead and --backup-dir, but the former didn't help much and the latter isn't available for mount. Is there any other way to secure files from being corrupted / deleted?

What is your rclone version (output from rclone version)

1.53.1

Which OS you are using and how many bits (eg Windows 7, 64 bit)

macOS 10.15.17

Which cloud storage system are you using? (eg Google Drive)

SSH (own rpi storage)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

./rclone cmount --daemon --vfs-cache-mode full --cache-dir cache --vfs-cache-max-age 120h --vfs-cache-poll-interval 15m --vfs-cache-max-size 1g --vfs-read-ahead 50m --vfs-write-wait 10s --log-file log.txt --log-level ERROR [ssh mount comes here]

A log from the command with the -vv flag

I see a series of those errors:

2020/10/18 19:16:30 ERROR : sounds_zero/34.mp3: vfs cache: failed to kick waiters: vfs reader: failed to write to cache file: Open failed: file does not exist
2020/10/18 19:16:30 ERROR : sounds_zero/34.mp3: vfs cache: failed to _ensure cache vfs reader: failed to write to cache file: Open failed: file does not exist
2020/10/18 19:16:30 ERROR : sounds_zero/34.mp3: vfs cache: failed to download: vfs reader: failed to write to cache file: Open failed: file does not exist

Can you try the latest beta and see if you can replicate the problem there?

If you can I'd really like to see a log with -vv showing the problem.

I've been testing it a bit today and yesterday and... so far so good. Can't replicate the error. I'll keep an eye on it and get back with more details if it's back.
PS. Testing v1.54.0-beta.4831.4e2b5389d

Thanks for testing - I'd like to see logs with -vv if it goes wrong again - thanks!

OK... so I managed to break it again. The context:

  • sftp is mounted to a folder on OS X
  • mp3 files there were tagged and saved
  • after that they were renamed / moved to a new folder within same mount

These files disappeared: 01-02-taylor deupree-northern.mp3 / 01-05-taylor deupree-haze it may be.mp3. The 2nd was rescued from the cache. The 1st is in the cache too, but corrupted: looks like its size is correct, but it's all empty (full of \x00??).

-vv log (a bit sanitized and zipped) available at ht_ps://dropfiles.org/O31ulvig

Thanks in advance for looking at it.

Thanks, that is a really useful log...

Here is what I think happened

The file was written and queued for upload

2020/10/24 19:16:25 INFO  : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: queuing for upload in 5s

The upload then started 5 seconds later

2020/10/24 19:16:30 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: starting upload

However something then opens the file for write

2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: Open: flags=O_RDWR
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: newRWFileHandle: 
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: >newRWFileHandle: err=<nil>
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: >Open: fd=sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3 (rw), err=<nil>
2020/10/24 19:16:38 DEBUG : /sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: >OpenFile: fd=sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3 (rw), err=<nil>
2020/10/24 19:16:38 DEBUG : /sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: >OpenEx: errc=0, fh=0x1

And writes some data at the start of the file

2020/10/24 19:16:38 DEBUG : /sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: Write: ofst=0, fh=0x1
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3(0xc001451140): _writeAt: size=4096, off=0
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3(0xc001451140): openPending:

Meanwhile the upload is still running writing data to the actual file.

Now rclone checks the fingerprint of the remote file. This causes the server to run the md5sum command on the file. The remote file is in the process of uploading so the md5sum of the remote file has changed even though the upload isn't complete (this is unlike any other backend except local)

2020/10/24 19:16:38 DEBUG : sftp cmd = /something/sounds_zero/taylor\ deupree\ -\ northern/01-02-taylor\ deupree-northern.mp3
2020/10/24 19:16:38 DEBUG : sftp output = "b4febee818cadf8311e3e99bbcee7d7b  /something/sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3\n"
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: checking remote fingerprint "23165404,2018-01-27 22:24:35 +0000 UTC,b4febee818cadf8311e3e99bbcee7d7b" against cached fingerprint "23165404,2018-01-27 22:24:35 +0000 UTC,4224518601e549258e053e6f180bae1e"

Now rclone sees that the remote has changed so the local data is out of date so deletes it. This is questionable - what rclone has detected here is a conflict - the local data got changed and (it thinks) the remote data got changed too.

I think this is the fundamental cause of the problem.

2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: removing cached entry as stale (remote fingerprint "23165404,2018-01-27 22:24:35 +0000 UTC,b4febee818cadf8311e3e99bbcee7d7b" != cached fingerprint "23165404,2018-01-27 22:24:35 +0000 UTC,4224518601e549258e053e6f180bae1e")

Meanwhile we cancel the in-progress upload since the file has been written to locally.

2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: cancelling writeback (uploading true) 0xc0004f4a10 item 2
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: cancelling upload
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: Removed after failed upload: context canceled
2020/10/24 19:16:38 ERROR : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: Failed to copy: Update ReadFrom failed: context canceled
2020/10/24 19:16:38 INFO  : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: upload canceled
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: cancelled upload

And remove the cache file since rclone now is convinced the one on the sftp server is newer.

2020/10/24 19:16:38 INFO  : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: removed cache file as stale (remote is different)
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: removed metadata from cache as stale (remote is different)
2020/10/24 19:16:38 DEBUG : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: truncate to size=23165404

However something got out of sync and we never created the file... What we should be doing is downloading the one from the server if there was a real conflict (not the one rclone just made up!)

2020/10/24 19:16:39 INFO  : sounds_zero/taylor deupree - northern/01-02-taylor deupree-northern.mp3: vfs cache: downloader: error count now 1: vfs reader: failed to write to cache file: Open failed: file does not exist

Further analysis

  • This could only happen on the SFTP or local backends. Fixing these to upload to a temporary name then rename it into place would fix this problem
  • Deleting the dirty data seems incorrect to me
  • You can probably work around this by increasing --vfs-write-back from its default of 5 seconds.

@leoluan since you are up to speed with VFS internals - what do you think - is that analysis correct?

1 Like

Thanks for the comprehensive check of logs :wink: I'll experiment with a longer write-back and report if it helps.

I am not familiar with all the backends Rclone supports. I guess you said that this problem only affects local and sftp because only these two backends do not support atomic updates to the remote files at Rclone's whole file update granularity? The use of the temporary name and rename scheme makes sense in this case, assuming the rename operations do not incur significant overhead (such as S3 backend).

Deleting the dirty data does seem incorrect. There seem to be two concurrency-semantics design decisions to be made here.

(1) Should the local overwrite be allowed while a write-back upload is ongoing?

Maybe the write-backup should be canceled upon new local writes?

(2) Should the write-back upload continue if the file in the backend is really changed by some other independent process?

It does not seem that Rclone supports a single system image across different ways of accessing the files. Mixing local updates with two different versions of the file in the backend would be as bad as discarding the local changes. Maybe the right thing to do in this case is to save the local copy in the cache for recovery without forcing a write-back?

That is correct. Unlike all the other backends you can actually modify an existing file. (I guess FTP might be like this too).

I think the write back upload should be cancelled here and it was cancelled. However due to the fingerprint of the remote file changing, rclone then believes the remote file has more current data and deletes the local data.

I guess this is a conflict - removing a stale cache file that has dirty data and maybe we should do something special here. Ideally if we added the write to temp file part then we would never get to this point unless the the remote file had changed and the local file had changed.

Good question... I don't think we've really defined the conflict behaviour for the VFS. I think it is implemented so that the remote always wins...

So I think there are 2 different problems that might need fixing here:

  • writes to backends that actually alter the existing file (local,SFTP,FTP) should be writing to a temporary file
  • we should detect conflicts (remote data modified and local cache file is dirty) and at very minimum log about them. Renaming the file to let the user deal with it and removing it from the cache seems a pragmatic solution and it isn't too hard.

It might be interesting to put this conflict detection in and see if it happens in the VFS torture tests...

Renaming the file in the cache upon detected conflict sounds right. Should this renamed cache item be also written back? Keeping this file for user recovery seems to make sense.

Hmm, as file.txt.conflict-randomnumber or something like that? Not sure...

Yes.

In the code, the cancellation (in item._dirty()) seems to be done after some data is already written to the cache. This means that the write-back upload can potentially include some of the newly written data and complete before the cancellation. Maybe the cancellation should be done before any new write to the cache item so that we have a close-to-open cache consistency semantics.

The cancellations are somewhat asynchronous so waiting for the cancellation to finish might be problematic...

But I see what you mean.

There is an open issue for that, waiting for contributors like @leoluan since 2017 :slight_smile: