Concurrent read accesses on the same file through Rclone VFS mount

We have a use case where a virtual machine image (a qcow2 file) is stored in S3 storage and needs to be accessed by a QEMU process (read-only) through VFS interface. Rclone's fuse mount works. However, the fuse filehandle mutex prevents more than one IO from being served concurrently and the performance is not as good as s3fs for concurrent random IOs. I guess the current VFS implementation of Rclone targets sequential IO for video playing and buffers on one stream controlled through the internal seek operation. Can we extend this design to allow multiple IOs to be served concurrently, maybe using multiple ChunckedReader objects (per filehandle) that are maintained using an LRU policy to keep a finite number of streams within the same open file? Each new read can use one existing ChunkedReader if it's sequential within an existing stream or replaces the least recently used ChunkedReader if it's starting a new stream or random.

I presume you are talking about --vfs-cache-mode off here?

I think that is an interesting idea!

As far as I'm aware though the kernel doesn't actually tell us the mapping between open file handles and reads, it just sends read(offset, size) commands to us. So we have to work that out like you suggested.

What we could do is something like this

  • we get a read for offset X - open the file and return data
  • we get an read for offset Y which needs seeking
    • currently we close the reader and reopen at Y
    • instead we open a new reader for this
  • we can now read at both places. If a reader is not read for 5 seconds we can close it.

This scheme is remarkably similar to the one I implemented for the new --vfs-cache-mode full which I'm working on at the moment which you try here if you want - this does exactly that but caches the data on the disk in a sparse file.

https://beta.rclone.org/branch/v1.52.1-082-g017d7a53-vfs-beta/

I like the idea of doing this for the cache mode off too. We'd need a way of telling the async reader not to carry on reading ahead until it got another read request otherwise this could potentially over read data - that is the only complication I see.

I think all the vfs-cache-modes will be interesting when supporting multiple streams mixed with random IOs. Currently, I am experimenting with --vfs-cache-mode minimal. It would be great if we can get any of these modes to work for random IOs if not all of them at the same time.

Can we collaborate on this? I can work on it. But I am new to Rclone and do not understand all its pieces yet. If you can provide a rough patch for this, e.g., how to refactor the concurrency control currently enforced through the use of the filehandle mutex, it can help expedite this significantly when I help to carry it out.

I did try the new --vfs-cache-mode full beta. However, it does not help the concurrent random IOs that are accessed only once. I think it's because the filehandle mutex still sequentializes all the concurrent random IO requests.

I agree. A throttle to speed up, slow down, or stop the reading ahead would be needed.

I'd love help! However I'm in the middle of a massive VFS refactor at the moment so after that has landed in master!

Ah, so that is the thing I'd like to fix first...

That should be relatively easy to fix I think - just a question of releasing the RWFileHandle mutex while we are doing the IO.

If I do that on the new VFS branch would you be willing to test it out? I don't want to implement it twice!

Sounds great!

Sure. I will certainly be willing to test it out for the concurrent random access performance.

"concurrent random access performance"

me too, i think i have a good test for this.

I took another look at my use case. The mutex that serializes the concurrent random reads is ReadFileHandle, not RWFileHandle. Will the fix you have in mind for releasing the RWFileHandle work for ReadFileHandle as well?

If you use --vfs-cache-mode full then it will be the RWFileHandle.

For the ReadFileHandle the reads need to proceed sequentially at the moment so releasing the mutex won't help until we get the multiple readers.

Try this with --vfs-cache-mode full and see what you think.

https://beta.rclone.org/branch/v1.52.1-094-g6713f865-vfs-beta/ (uploaded in 15-30 mins)

I tried it with my test using the following mount command.
/tmp/rclone-linux-amd64 --no-check-certificate mount tucson-chunk:/ /mnt/ibmcos-cache --allow-other --buffer-size=0 --vfs-read-chunk-size=64k --vfs-cache-mode=full --log-level=DEBUG --log-file=/spdb/rclone.log --cache-dir=/mnt/ram_disk_cache

The performance did not improve and it still caches the whole file at the end (according to du size of the cache dir) of the test despite that the test only randomly accessed some parts of the file. Maybe somehow the new full vfs-cache-mode code was not used?

Is the source in the VFS branch? I can try using debugger to see what's going on. I also have the log file if that helps here.

try du with the --apparent-size argument to see if it reports a different value. du by default is going to report the sparse file that rclone is creating with the new cache mode as a full-sized file and not a sparse file.

I think you'll need to set --buffer-size to see a significant improvement. It may be that the code is broken with --buffer-size 0 - I haven't tested with that.

How big is the file you are accessing?

Yes the source is in the VFS branch. Note I will rebase this branch regularly so be prepared!

Here is the latest build

https://beta.rclone.org/branch/v1.52.1-097-g8578d214-vfs-beta/ (uploaded in 15-30 mins)

du with and without --apparent-size reports the same full size of the file being accessed by this test that randomly accesses this file partially.

The file size is 7.8G. The sizes of the random reads vary from 4K to 128K. What --buffer-size do you suggest using?

Blockquote
Here is the latest build

https://beta.rclone.org/branch/v1.52.1-097-g8578d214-vfs-beta/ (uploaded in 15-30 mins)

I don't see a build for linux, not sure if that was intended, but I would be interested in testing this out as well.

I'd try with the default of --buffer-size 16M

Looks like I broke stuff - I'll try again tomorrow morning!

I tried this version with --buffer-szie=16M, which is identical to my chunker backend's chunk size. Got a significant improvement in performance cutting the VM boot time to half compared to previous measurements.

However, the code does not seem to check --vfs-cache-max-size (3GB) early enough, exhausted my ramdisk cache space (10GB), and got hung when the amount of the file data was accessed beyond the cache size. The file size is 12GB.

Nice!

That --vfs-cache-max-size is more of a soft limit. If you have a file open and you read more than that much data then you will exceed the limit.

The VFS cache can't remove chunks of data from files yet which maybe it needs to - it is the whole thing or nothing.

Before chunk-level cache eviction is implemented, can we have a stop-gap solution to purge the file cache entirely when the --vfs-cache-max-size limit is reached?

I think what you are asking for is if the cache limit is breached then remove open files.

At the moment only closed files are eligible for removal.

If I allowed it to remove open files then the backing file for the open files could be removed and recreated. That would allow it to stay under quota at all times.

Is that the kind of thing you were thinking of?

I think by "backing file" you mean the sparse vfs cache file, right? Yes, I was suggesting that, as a stop-gap solution before chunk-level cache eviction is implemented, the VFS cache code can remove the cache file when the --vfs-cache-max-size limit is reached and restart warming the empty cache afterward.

I am not sure what you mean by "remove open files". You mean "remove the cached data of open files", right?