Concurrent read accesses on the same file through Rclone VFS mount

Yes that is right.

Yes that is right

It sounds like we are in agreement. The only tricky part about doing this is the locking as there are quite a few moving parts, but I think it should be possible to close, delete, reopen and sparse the backing file without having to close the download steam(s). Though it would be necessary to keep some data from the old file to make it worth while leaving the downloaders open so maybe that is a complication too far.

I put this idea on the list to implement as I know you are not the only one who would like it!

This works fine for reading. What do you thing rclone should do if when writing it gets overquota? It could empty the cache, but that still might not be enough. It would have to start returning disk full errors from the writes I suppose?

At the moment the cache is only checked for being over quota every 60 seconds by default (this is changeable). Do you think that is acceptable or that I should be keeping a running total of cache usage?

Note that rclone uses the sum of the sizes of written to blocks in the file as the usage for each file in the backing store. This isn't 100% accurate as there is overhead from the OS it isn't taking account of.

Can we let the current downloaders continue doing their work without writing data to the backing file? This way the on-going reads do not need to be suspended. Once all the downloaders are running in this mode, it should be safe to purge the backing file, right?

Thanks! This is very important for our use case, where the file data being read can far exceed the cache size.

Why would emptying the cache (assuming flushing all the writes) be still not enough?

Since the cache (ramdisk) will not be huge, we may need to keep the check period smaller than 60 seconds if we can get very fast read throughput and or we need to be very conservative in setting the quota to be much smaller than the available space. Keeping a running total cache usage and run the purge proactively inline or having a way to recover from ENOSPC would be valuable.

I guess allowing the cache full mode to fall back to not write to the backing file can solve the corner cases where the size estimate is not enough and the downloader still encounters ENOSPC?

That is relatively easy. However it is likely that

  • the user is reading sequentially from the file
  • the next bit the user wants has been written to the file
  • the downloader is downloading ahead of what the user wants

So dropping the file while leaving the downloaders running would leave a gap in the data for the user so rclone would start another downloader to fill that gap.

Rclone can only flush whole files to backing storage - that is the way that cloud storage works. So for a file which is being written to (ie a Dirty file) we need to keep it in the cache until it is ready to upload in its entirity.

The max cache size would therefore limit the size of the largest file that can be uploaded.

A running total would be the most reliable solution.

What do you mean by recovering from ENOSPACE? Do you mean rclone takes this as a signal to prune the cache?

I guess the downloaders could note this and pause writing that block or something?

I guess this means a little hiccup at the cache size interval. I think it is acceptable in our current use case initially.

We are using chunker backend to keep the object size smaller (e.g., 16MB). Maybe Rclone can allow its dirty cache flush to be done along the chunker boundary? This should be beneficial for other use cases of chunker backend as well because chunkers are designed to be used by larger files and cache size can be a pressure.

Yes, I mean rclone takes ENOSPC as a signal to prune the cache, and downloaders can pause what they do and wait for the space to become available to try again. Alternatively, I thought that the reads may be able to continue while pruning is going on since the downloaders are getting the data that fulfills the ongoing read requests. Although the data cannot be written to the cache file, it's in a memory buffer that can be used to fulfill the read requests, right? I guess the code is not currently structured that way. Just curious. :slight_smile:

OK :slight_smile:

I haven't really worked out how to punch holes in the sparse file yet. Doing it along 16 MB boundaries seems like a reasonable idea!

The code is structured so the data goes to disk first and then back to the application. This makes the read path identical for cache vs uncached data - we just need to make sure the data is cached first.

I did think of making a shortcut so data could go straight from the read buffer to the application as well as going to disk, but I think it would be quite complicated and my head nearly exploded getting the first iteration right - lots of concurrent parts are difficult to keep track of!

Maybe starting with Linux and file systems that support fallocate with the FALLOC_FL_PUNCH_HOLE flag?

I appreciate your great work even without the fuse read directly out of the buffer. I hope to help code some of these when I get more familiar with the surrounding pieces.

I need to work out how to do that on Windows too...

You've convinced me it is a good idea for multiple readers without caching. I think that will work best for your application.

That's great! :slight_smile: I guess the seek func of ReadFileHandle will call something like
func (acc *Account) ReserveReader(in io.ReadCloser)
as opposed to the current callee
func (acc *Account) UpdateReader(in io.ReadCloser)

The Account structure maintains an array of readers.

When initializing each reader, maybe the 16 channel tokens should not be sent all at once? Only send one token so that we don't read too much data unnecessarily for small random VFS reads. I guess a variable number of additional tokens can be sent in fill() (called in Read) using some throttle that accelerates or decelerates the buffering?

The least recently used reader can be updated using a refactored UpdateReader for a new stream.

Do these make sense?

This could be done in the ChunkedReader or in the ReadHandle...

Doing it in the accounting has a certain appeal but that code is rather complex already and is a source of bugs...

Do you mean in the async reader? Note that we soft start the async reader by making it read 512 bytes to start with and doubling until 1MB.

I really like this idea of this thread. We basically get all the benefits of the full mode, without using local storage is that right?

I also have lots of random reads, and haven't tested the full cache mode yet, but I don't know how well it would behave when you have 20k+ files

I think it really depends on your use case. If you have repeated read requests for the same files or the same parts of big files and you have a large-enough cache, the cache will help. In our use case, there is some repeated access, but the cache size is limited. So unless Rclone adds cache replacement or some stop-gap solution, we will not be able to use the full-cache mode.

I think if you have temporal locality in the access pattern and your cache size is larger than the size of the working set, then full cache mode will help. Nick also included a fix in the full cache (vfs) branch to allow random reads to be concurrently served for the same file. That helps concurrent random access to the same file tremendously. But if your access pattern does not have much temporal locality and you have large open files overrunning the cache, then it is like our use case where multi-stream buffered reads without using cache would work better.

I guess we can maintain an array (or an LRU'ed list) of ChunkedReaders in the ReadFileHandle? How about the Account object? Should it be per ChunkReader or per file?

Yes, I mean in the async reader. It looks like the softStartInitial = 4 * 1024, not 512 bytes? I seem to see 16 tokens sent in a for loop. That means the goroutine inside the init function of Asyncreader will read non-stop 4K + 8K + 16K + 32K + 64K + 128K + 256K + 512K + 8 * 1M ~= 9MB at the beginning?

Normally we do one Account per transfer, so one per chunked reader.

Being able to kill off the Update code in the Account would be a win - it's complex, has caused bugs and it breaks encapsulation with the async reader...

Yes that's right.

It could read that much but remember it reads asynchronously so it depends on how fast the network sends it.

For this purpose it might be worth having a pause for the async reader so it stops it's background reading until it is read again if we have to open a new reader.

I guess the account values can be extended/used for the purpose of victim selection when we need to get rid of an idle, least recently accessed, or less valuable reader (with less already filled buffer). It seems to be used for ETA, speed, etc. statistics for sequential access. I guess we can use it for prefetching throttle control of sequential accesses that are mixed with other independent sequential or non-sequential random accesses.

How does the soft-start in Asyncreader effect the http backend (or in our case it's the chunker) backend? Why not letting the first read to be the size of the vfs read size (as opposed to be always 4K), pause, and then accelerate upon next VFS reads that are sequential within the same stream?

The VFS read size is often very large (it defaults to 128M) - we don't actually read the data in that size chunks we request those size chunks from the other end but read the data out in the smaller buffers in the AsyncReader.

I see. Maybe it's better to start from min(vfs request size, buffer size)? Just curious ...

For the multi-reader approach, can we start with changing the Account member of the ReadFileHandle to be an array of Account instances with a maximum length (an added member in ReadFileHandle)? Each account instance has its own reader. Upon each read request, we look for an account/reader with the right offset and use it if existing, thereby avoiding a seek in that case. If such a reader does not exist, we add a new reader if we have not reached the maximum number of readers yet. Otherwise, we find the reader with the least "Number of bytes read since last measurement" and update that reader to be used for the new read. Does the account mutex properly control the concurrency here? It looks like we need to acquire the Account mutex before releasing the ReadFileHandle mutex so that a reader is not disrupted by a new VFS read service thread that finds it a victim to be replaced, or maybe add an in-use field in Account.

Where do we release the ReadFileHandle mutex before doing the io? Can we do the release and relocking in func (fh *ReadFileHandle) readAt after the potential seek?

release ReadFileHandle mutex
n, err = io.ReadFull(fh.r, p)
lock ReadFileHandle mutex

The other change is to add a throttle to the readers. Just hoping to get the correct code structure identified first here ...

Can you please help correct/confirm my understanding of the relations between ReadFileHandle, Account, Transfer, AsyncReader, ChunkedReader?

It looks to me that each time a file is opened/reopened, a new Transfer is added and an Account is created too and they have a one-to-one relationship. A ChunkedReader is added for this account and a AsyncReader is added for larger file. Whenever a new read that involves a seek, we try to use the existing AsyncReader to skip forward, or backward within a buffer. Failing to do that we do a RangeSeek to get the ChunkedReader to the right position in the file. If there is any error, we do a reopen and create a new ChunkedReader/AsyncReader in the process.

I have been thinking that, when adding multiple readers for the same open file, each AsyncReader/ChunkedReader is going to belong to a separate Account tracking that reader's progress, e.g., tracking whether it's streaming a large on-going sequential read. If this is the right approach, what should we do with the Transfer in relation to multiple Accounts for the same open file? How is Transfer used today? Is it relevant to non-video files? Do we use one transfer for all accounts of the same open file?

You said something about getting rid the update code in Accounting because they are buggy. Do you mean the reader updates during seeks? UpdateReader? RangeSeek, or SkipBytes?

Sure

I think that summary is very good!

The Transfers can outlive the the Accounts as is retries are needed a new Account is made from the existing transfer.

I'll just explain what each bit is for that should help understanding

  • Account - monitors a single stream in a transfer. Is an io.Reader. Does accounting (adding up bytes transferred), rate limiting (--bwlimit), and quota (--max-transfer)
  • Transfer - this is entirely for producing the statistics, so the stats you see with --progress or --stats
  • ChunkedReader - this is a wrapper around fs.Object to enable seeking and chunked reading (which turns out to be preferred by cloud providers if you aren't going to read the whole file. If you ask for the whole file, some providers will account you for that even if you didn't read it all).
  • AsyncReader is a read ahead buffer. It is an io.Reader. It lets the writer run ahead of the reader. You optionally get one of these when you ask for an Account. You are required to have an Account for every transfer, but not an AsyncReader.

That is correct

That is a good question.

In the new VFS branch I made a new Transfer for each Account. This may not be the right approach though as it counts a new transfer for each open

https://github.com/rclone/rclone/blob/4d9ad98a8c76427432f9e0d3d4216f1a72e0da44/vfs/vfscache/downloaders/downloaders.go#L92 (uploaded in 15-30 mins)

I should probably change that to just have one Transfer per downloaders.

In fact you might want to take a look at that file as it is very similar to what needs to be implemented!

I mean the reader updates during seeks, in particular fh.r.UpdateReader in the vfs/read.go code. That does a whole lot of really nasty things!

Making a new reader for each seek will make things much easier!

I don't know if you've come across the waitSequential code in read.go yet?

We'll still need that code. Annoyingly when using the VFS under a mount reads can come in out of order. We don't want to start a new stream in that case, we wait for a short time for an in-order read.

Thanks! I looked at downloaders.go. I wonder whether we can extend it so that it can be used to copy the data to the destination buffer provided in handle's ReadAt call? That is, can ReadFileHandle path pass the destination buffer in the downloader struct and let downloader's Write copy the read data into that buffer?

Alternatively or additionally, maybe we can modify the full cache mode to allow the read data to be passed back this way, in addition to writing the read data to the cache file? This eliminates the need to read the data from the cache file and it also allows the full cache mode to continue to operate when the cache space is full and we detect ENOSPC error. The cache writes in downloader.Write() can resume after purge or cleaner is able to reclaim cache space?

Today I merged the VFS branch to master so it is now available for pull requests!

I had a think about that and I think it would probably not be worth trying to re-use that code. downloaders.go tries very hard to be reading ahead of where the user is reading in an asyncrhonous way.

I think for our purposes a []ChunkedReader in the ReadFileHandle would probably be sufficient. We'd need to add an Offset method to the ChunkedReader so we could query where we would read from if we started reading now. I don't think anything more complicated than a slice is needed as there will only ever be a few ChunkedReaders so using a binary search is not necessary. I guess we also need to keep track of when the ChunkedReader was last read from for expiry purposes - we could push that into the ChunkedReader too.

That would be a nice improvement.

What we could do is change the Downloaders.Download call to take a []byte buffer too as well as the range. Then in Downloader.Write we could look through any waiters that might be present and write into the pending buffer. I guess we might need to keep a Ranges for the buffer so we know exactly what we have and haven't written. Then when the waiter returned we might have to read some additional stuff off the disk that was on there already.

I don't think that would be too complicated.

Carrying on after ENOSPC could be the next addition after that.

1 Like