We have a use case where a virtual machine image (a qcow2 file) is stored in S3 storage and needs to be accessed by a QEMU process (read-only) through VFS interface. Rclone's fuse mount works. However, the fuse filehandle mutex prevents more than one IO from being served concurrently and the performance is not as good as s3fs for concurrent random IOs. I guess the current VFS implementation of Rclone targets sequential IO for video playing and buffers on one stream controlled through the internal seek operation. Can we extend this design to allow multiple IOs to be served concurrently, maybe using multiple ChunckedReader objects (per filehandle) that are maintained using an LRU policy to keep a finite number of streams within the same open file? Each new read can use one existing ChunkedReader if it's sequential within an existing stream or replaces the least recently used ChunkedReader if it's starting a new stream or random.
I presume you are talking about --vfs-cache-mode off here?
I think that is an interesting idea!
As far as I'm aware though the kernel doesn't actually tell us the mapping between open file handles and reads, it just sends read(offset, size) commands to us. So we have to work that out like you suggested.
What we could do is something like this
we get a read for offset X - open the file and return data
we get an read for offset Y which needs seeking
currently we close the reader and reopen at Y
instead we open a new reader for this
we can now read at both places. If a reader is not read for 5 seconds we can close it.
This scheme is remarkably similar to the one I implemented for the new --vfs-cache-mode full which I'm working on at the moment which you try here if you want - this does exactly that but caches the data on the disk in a sparse file.
I like the idea of doing this for the cache mode off too. We'd need a way of telling the async reader not to carry on reading ahead until it got another read request otherwise this could potentially over read data - that is the only complication I see.
I think all the vfs-cache-modes will be interesting when supporting multiple streams mixed with random IOs. Currently, I am experimenting with --vfs-cache-mode minimal. It would be great if we can get any of these modes to work for random IOs if not all of them at the same time.
Can we collaborate on this? I can work on it. But I am new to Rclone and do not understand all its pieces yet. If you can provide a rough patch for this, e.g., how to refactor the concurrency control currently enforced through the use of the filehandle mutex, it can help expedite this significantly when I help to carry it out.
I did try the new --vfs-cache-mode full beta. However, it does not help the concurrent random IOs that are accessed only once. I think it's because the filehandle mutex still sequentializes all the concurrent random IO requests.
I agree. A throttle to speed up, slow down, or stop the reading ahead would be needed.
I took another look at my use case. The mutex that serializes the concurrent random reads is ReadFileHandle, not RWFileHandle. Will the fix you have in mind for releasing the RWFileHandle work for ReadFileHandle as well?
I tried it with my test using the following mount command.
/tmp/rclone-linux-amd64 --no-check-certificate mount tucson-chunk:/ /mnt/ibmcos-cache --allow-other --buffer-size=0 --vfs-read-chunk-size=64k --vfs-cache-mode=full --log-level=DEBUG --log-file=/spdb/rclone.log --cache-dir=/mnt/ram_disk_cache
The performance did not improve and it still caches the whole file at the end (according to du size of the cache dir) of the test despite that the test only randomly accessed some parts of the file. Maybe somehow the new full vfs-cache-mode code was not used?
Is the source in the VFS branch? I can try using debugger to see what's going on. I also have the log file if that helps here.
try du with the --apparent-size argument to see if it reports a different value. du by default is going to report the sparse file that rclone is creating with the new cache mode as a full-sized file and not a sparse file.
I think you'll need to set --buffer-size to see a significant improvement. It may be that the code is broken with --buffer-size 0 - I haven't tested with that.
How big is the file you are accessing?
Yes the source is in the VFS branch. Note I will rebase this branch regularly so be prepared!
I tried this version with --buffer-szie=16M, which is identical to my chunker backend's chunk size. Got a significant improvement in performance cutting the VM boot time to half compared to previous measurements.
However, the code does not seem to check --vfs-cache-max-size (3GB) early enough, exhausted my ramdisk cache space (10GB), and got hung when the amount of the file data was accessed beyond the cache size. The file size is 12GB.
Before chunk-level cache eviction is implemented, can we have a stop-gap solution to purge the file cache entirely when the --vfs-cache-max-size limit is reached?
I think what you are asking for is if the cache limit is breached then remove open files.
At the moment only closed files are eligible for removal.
If I allowed it to remove open files then the backing file for the open files could be removed and recreated. That would allow it to stay under quota at all times.
I think by "backing file" you mean the sparse vfs cache file, right? Yes, I was suggesting that, as a stop-gap solution before chunk-level cache eviction is implemented, the VFS cache code can remove the cache file when the --vfs-cache-max-size limit is reached and restart warming the empty cache afterward.
I am not sure what you mean by "remove open files". You mean "remove the cached data of open files", right?