VFS is slow when reading in small (8 bytes) chunks

What is the problem you are having with rclone?

Too slow read speeds when rclone mount with several computer graphics applications and formats:

  • Alembic (.abc) and Autodesk Maya 2022
  • Pixar texture format (.tex) and SideFX Houdini 19.5.569

Probably there are more issues as some users blame rclone mounted disks when working with Adobe Substance painter projects (~20Gi).

Run the command 'rclone version' and share the full output of the command.

rclone v1.62.2
- os/version: Microsoft Windows 10 Pro 22H2 (64 bit)
- os/kernel: 10.0.19045.3208 Build 19045.3208.3208 (x86_64)
- os/type: windows
- os/arch: amd64
- go/version: go1.20.2
- go/linking: static
- go/tags: cmount

Which cloud storage system are you using? (eg Google Drive)

WebDav hosted in local network

Please look following log taken with Process Monitor


You can see that there is large file that is reading by 8 bytes a call, and it takes 0.5ms, so on large files (100Mi) it is very sensible (compare to 1Mi reading - takes 9ms).

We experience 3 times slower rendering with rclone mounted disks as compared to mounted windows shares with Samba protocol. No matter if it is first render or subsequent with cache populated.

I think it is makes sense to implement some kind of memory caching so subsequent reading would be faster. Have you any ideas on implementing such feature, how memory cache size should be calculated, where it have to be proecessed?

Thank you!

What about following:

  1. Inject in vfscache.Item.readAt()
  2. Introduce minimum vfs cache pull limit, say 1024 bytes
  3. download and read local cache in chunks of that size
  4. Use at most 1024 chunks in memory with LRU pull policy for all opened items

So memory consumption will not be significantly increased, and a lot of maintenance will be reduced.

By the way , i found that reading cache file require read permission each read (even 8 bytes) checking, it consist of some operations (QueryAttribute, QueryInformation) that takes about 0.45ms, when actual read takes about 0.15ms. So getting rid of those operations would also increase reading speed in small portions.

It would help if you share your remote config and actual mount command you are using.

full command: "C:\rclone.exe" mount :webdav: P: --read-only --volname "DATA_LOCAL" --webdav-vendor nextcloud --webdav-url "http://fs.domain.local:9999/store/data" --vfs-cache-mode full --vfs-cache-max-age 240h --file-perms 0777 --dir-cache-time 120s --network-mode

here is another monitoring log taken from regular samba share, just for understanding on how some applications read data:

You can see that it read some portions several times, and portions would be of 4 bytes.

I guess the slow downs are caused by permission checking on each pull from cache, here

and it result in following Query calls:

It hard to say on how long was first QueryInformationVolume, but even second two Queries took 65+63=128 microseconds that is more that actual read of 95 was.

Sure that missed permissions on vfs cache file or accident deletion of file are exceptional cases and should not be checked.

fascinating analysis - for sure there is always potential to improve VFS performance.

Not sure though how many people use rclone mount for high performance applications like in your case.

Let's see if anybody is interested in picking it up..

I will move this thread to Features as it is not a problem as such but interesting idea about improvements.

on a side note I see that you are using --network-mode. Maybe worth to try without it? Not a Windows expert but it might affect OS caching mechanism.

I do not see a difference without --network-mode, only files in cache are additionally populated at the end.

Got some more info:

  1. extra Query* operations do not occur when i read files in simple test Go program (i even copied OpenFile function from rclone/lib and called with same parameters as vfscache do), they are only under rclone.
  2. i see that test program read 5 bytes from a file for 255 microseconds, but extra calls and read itself take only 4.6+5.6+2.6+7.4=17microseconds, so optimization for that calls may not change it a lot.

So maybe communication with WinFsp is the weakest part, and read-ahead optimization should be implemented on that level.

I made test for reading single file in portions of different size:

Read 29339600 bytes in 0.025776599999999983 seconds - by 4096 bytes, local
Read 29339600 bytes in 0.15013739999999998 seconds - by 4096 bytes, winfsp-passthrough-fuse3-x64.exe
Read 29339600 bytes in 0.28682240000000003 seconds - by 4096 bytes, samba share
Read 29339600 bytes in 0.31780699999999995 seconds - by 4096 bytes, rclone
Read 29339600 bytes in 8.413499999999999 seconds - by 8 bytes, local
Read 29339600 bytes in 10.6984064 seconds - by 8 bytes, samba share
Read 29339600 bytes in 68.97246200000001 seconds - by 8 bytes, winfsp-passthrough-fuse3-x64.exe
Read 29339600 bytes in 152.11466099999998 seconds - by 8 bytes, rclone

I could conclude that reading big files in small chunks is very inefficient, but samba share client look like have read ahead for sequential reading (and on the web there are some resources that state that). WinFsp introduce significant penalty on reading and rclone double it.

and code as follows:

import time

# f = open('z:\\BaseColor.tex', 'rb', buffering=0)
f = open('C:\\BaseColor.tex', 'rb', buffering=0)
# f = open('x:\\BaseColor.tex', 'rb', buffering=0)
# f = open('\\\\fs1\\share\\BaseColor.tex', 'rb', buffering=0)
s = time.perf_counter()
l = 0
while True:
    d = f.read(4096)
    if not d:
        break
    l += len(d)
print(f'Read {l} bytes in {time.perf_counter() - s} seconds')
f.close()

I ve also checked returning zero buffer fill instead on downloading data and pulling localy cached data, so in case of 8 byte reading with 150 seconds i got 125. So i decide that implementing memory caching on rclone level will not change things significantly((

Try --buffer-size 0

That has been reported to improve the latency on Windows .

Your results seem to say that the application reading 8 bytes results in a call from the OS to only read 8 bytes.

As I understand rclone logs the best - could you confirm that with the rclone logs with -vv. If I've missed that above - my apologies - I find the screenshots rather hard to read.

What I would expect to happen is that the OS asks for much larger chunks than 8 bytes and buffers them in memory.

Here is what happens under linux when I read a single byte from a file - the OS asks for 16k.

2023/07/22 08:53:56 DEBUG : 10M(0xc001195280): _readAt: size=16384, off=0

I would expect Windows to be doing something similar and if it isn't we need to tweak it until it does!

That wouldn't be so much of a probem if we weren't doing 8 bytes reads and were doing 16k instead which I think is the underlying cause of the problem.

That is puzzling!

I think you are right. But I think we need to encourage the OS to do it somehow. Maybe there is a WinFSP option which would help...

I just had a look through the code and I think we removed the option by accident in this commit

before this commit we used to supply -o max_readahead=131072 but that seems to have dropped off.

Can you try adding -o max_readahead=131072 to your command line and see if that helps?

@albertony am I reading that change correctly? Did we used to supply max_readahead to Windows and now we aren't?

It seems so. I don't remember any possible reason. I guess testing with it explicitely set would give the ultimate answer.

I can see the reads going through on Windows 1:1 without buffering. I tried adding -o max_readahead=131072 attr_timeout=1 -o atomic_o_trunc which is what went missing but this made no difference at all :frowning: So it looks like this option wasn't doing anything. I think we should probably put them back though @albertony - the attr_timeout and atomic_o_trunc are important.

It looks like the magic option is -o FileInfoTimeout=-1 which I got from this thread: How to increase the read size in windows? · Issue #307 · winfsp/winfsp · GitHub - can you give that a go @kalemas ?

I'm not entirely sure what this option does but it seems to

  1. enable reading the file in chunks of 8k
  2. cache the file contents for a very short while (until it is closed)

This is the WinFSP source code which defines the options.

I haven't actually managed to find any docs on them!

Here is my test program which I ran on a 100M file with various sizes of reads and I looked at the -vv debug to see the read sizes there.

1 Like

Or maybe you found some docs saying they don't work on Windows @albertony ?

That could be the case. I will look into it (but currently on vacation, not much at computer these days).

1 Like

Here is the log with all options listed above. I see no difference :^( log.txt (3.4 MB)

I m going to open new issue on WinFsp. And there is note on fuse here , that i get as read ahead may not be exposed to fuse emulation layer of WinFsp.

Read ahead in not available with WinFsp passthrough test program as well.

Sorry there was mistake on WinFsp github ticket (double assign, that i checked before with no luck), with correct -o FileInfoTimeout=-1 i got read ahead! Speed in my cases is similar to samba share now.

So problem is looks to be resolved. Thank you!

Excellent - that is good news.

I think probably the -o FileInfoTimeout=-1 should be the default as I don't see why you wouldn't want readahead. We turn it on for Linux and macOS.

It said in the issue linked above that other values such as 1,000 should give readahead but I tried 1000, 1000000 and 1000000000 and none of them gave any readahead.

I shall put this on my list of things to ask Bill about when he is back at his desk.

@albertony thanks for looking at this. I've made a note that this needs looking at but it can wait - no problems.

Does this mean that for mounts under windows, we should add -o FileInfoTimeout=-1 to have the best performances ?