Testing for new vfs cache mode features

dotsam · June 15, 2020, 1:35am

Yes, that's my understanding as well, but when a file is evicted from the cache, it is based on oldest file created, not oldest chunk, and then will remove the entire file, not just partial chunks. So if a file is opened for read more than once, eviction time will be based on the initial open of the file, not if those earlier chunks are still being accessed. At least that's my understanding from @ncw's comments in this thread.

spotter · June 15, 2020, 6:16am

the issue (as I understand it) is because it's not storing individual chunks, but storing individual files. Therefore, it has no metadata (and would be overly expensive to maintain said metadata) and therefore the only evictable object is the file and the only real metadata it has to do that is the file system's last access time.

it would have to store the files in individual file chunks to enable per chunk evicting (personally I think that's the way it should go, but dont have time to even look at how to do that ATM, so beggars can't be choosers)

ncw · June 15, 2020, 3:20pm

Thanks for the log I managed to work out what was happening from the traceback.

It looks like this will only be triggered when a file is modified when it is being uploaded.

I think I've fixed it here (but I haven't managed to reproduce locally) - can you try?

https://beta.rclone.org/branch/v1.52.1-082-g017d7a53-vfs-beta/ (uploaded in 15-30 mins)

The files are evicted based on last access time not creation time. So the first file to be evicted will be the one which hasn't been accessed for the longest time.

The whole file (all the chunks in the sparse file) will be evicted at that point.

We do have a record of exactly what the chunks are in the file so it would be possible to evict chunks of the file based on their access time. However that is a lot more complex and I don't want to go there unless absolutely necessary!

dotsam · June 15, 2020, 5:13pm

Gave this a try again, and I'm still seeing the issue. I have however simplified the test case and found a few more details on when exactly this seems to error.

The command I've run is as follows:

cp testfile.mp4 /data/rclone-beta/testdir/ && sleep 8 && file /data/rclone-beta/testdir/testfile.mp4 && sleep 1 && file /data/rclone-beta/testdir/testfile.mp4

Sleeping for 8 seconds puts me at the point where the file is uploading, then the first invocation of file will interrupt the upload, but it's able to read the file. Waiting for a second and then attempting to read the file again, and that's where there's a deadlock.

Log: https://gist.github.com/dotsam/3f43c64e8486ec8be0f67035ad87d90d/raw/36fa7ccad4c73507420e723074ce8802cee897e4/vfsdebug.log

ncw · June 16, 2020, 8:34pm

Thanks for your repro - very useful!

I've managed to fix this twice!

fix the deadlock (again!)
stop rclone cancelling the uploads if you don't modify the file

This patch also

cancels uploads if we do modify the file (before we close the file) so earlier
fixes double upload bug that happened once I'd fixed the deadlock!
allow ReadAt and WriteAt to run concurrently with themselves which should speed things up for some workloads

Testing appreciated

https://beta.rclone.org/branch/v1.52.1-097-g8578d214-vfs-beta/ (uploaded in 15-30 mins)

root0r · June 16, 2020, 9:18pm

Build for some platforms seems to be broken again.

ncw · June 16, 2020, 9:43pm

I broke some stuff! I'll fix tomorrow morning (it is late here!)

root0r · June 17, 2020, 9:01am

Just for Info: Windows binarys still missing in latest vfs beta

Lex · June 17, 2020, 9:17am

Playing with the betas. So far no issues with daily use on 1G fibre with local server. Very much looking forward to seeing this in mainline betas

As a tangential aside: When I see people's mounts I am often astonished at how different their flags are (while understanding that different HW, BW and use-cases have different demands).

Even more amazing is how resilient rclone is to give good results with such variations in setups. As an example, if you compare rootax's and dotsam's posted mounts:

Same flags
--use-mmap
--fast-list

Same flags, different value rootax
--poll-interval 60s
--vfs-cache-mode full
--vfs-cache-max-size 200G
--drive-disable-http2=true
--attr-timeout 8700h
--dir-cache-time 2h
--transfers 8

Same flags, different value dotsam
--poll-interval 24h
--vfs-cache-mode writes
--vfs-cache-max-size 10G
--drive-disable-http2=false
--attr-timeout 0
--dir-cache-time 48h
--transfers 6

Unique flags rootax
--cache-dir
--stats 10s
--buffer-size 16M
--vfs-cache-poll-interval 30s
--async-read=true
--vfs-read-wait 5ms
--vfs-write-wait 10s
--vfs-read-ahead 16M
--vfs-read-chunk-size 16M
--vfs-read-chunk-size-limit 1G
--local-no-check-updated
--drive-chunk-size=256M
--multi-thread-streams=4
--multi-thread-cutoff 250M
--vfs-case-insensitive

Unique flags dotsam
--allow-other
--umask 002
--vfs-cache-max-age 24h
--rc
--log-level INFO
--gid 1001
--uid 1000
--drive-skip-gdocs
--drive-use-trash=false

Animosity
--allow-other
--dir-cache-time 1000h
--log-level INFO
--poll-interval 15s
--umask 002
--user-agent animosityapp
--rc
--rc-addr :5572
--vfs-read-chunk-size 32M

No judgment intended. Simply interesting

ncw · June 17, 2020, 9:35am

Still fixing stuff!

If the binaries are missing for an architecture it usually means the tests didn't pass for that architecture or the CI blew up. In this case the tests cratered really badly

ncw · June 17, 2020, 9:37am

Great Thank you for testing!

That reminds me, --vfs-cache-mode full will work best with some buffer but not too much. 16M is probably about right. I think it probably doesn't work very well at all with 0 buffer.

Dual-O · June 17, 2020, 11:02am

I've set --buffer-size 128M since the past days of testing. Works good.
But when I will change my emby setup I would prefer a much bigger buffer between 512M and 1G. Has anyone tested this high buffer size?

`

ncw · June 17, 2020, 2:17pm

If you are using the new --vfs-cache-mode full I don't think you'll need to set it that big unless you are really paranoid about dropouts. 512M is a lot of streaming time. Note also that rclone may open multiple read points in the file so it can (temporarily) exceed the buffer size.

maciekb · June 17, 2020, 2:34pm

Can the size of the buffer be dynamic, depending on the size of the file being opened? This could in many cases help to reduce unnecessary downloads of large amounts of data for files with a small bitrate, which are usually much smaller than those with a large bitrate.

ncw · June 17, 2020, 2:35pm

The buffer dynamically extends in 1MB chunks so will only be as big as needed up to --buffer-size

Rootax · June 17, 2020, 3:38pm

with a buffer at 16m , i see download around 2GO/ sec, so I think it's enough. indeed..

ncw · June 17, 2020, 5:17pm

Here is a new beta - build and ready for testing! (The Windows 64 bit binary is still building as the CI failed to link the code for some random reason known only to itself so I had to run it again!)

https://beta.rclone.org/branch/v1.52.1-102-ge3c216e9-vfs-beta/

It has the same changes as the previous one but runs through the continuous integration tests now!

Debugging this highly concurrent code is really difficult! I spent most of the day and some of yesterday tracking down the final problem, staring at the screen feeling really stupid, before I remembered I could use git bisect and managed to fix the problem in about 10 minutes I've now put so much debug in that it will probably collapse under its own weight forming a small debug black hole.

Things I have on my todo list before merging

write the final batch of tests for the writeback code
rework the "Fingerprint" code which detects whether files have changed remotely
work out what happens with --buffer-size 0
write lots of docs
Make sure Attr doesn't block when uploading files

Things to think about next

If we download the file straight through then check the hash or maybe check after file is complete
expiring open files so we can always stay below the cache size limit
using the concurrent readers in the --vfs-cache-mode off/writes which will improve latency greatly

Did I forget anything?

Rootax · June 17, 2020, 6:01pm

Some stats about cache hit ? : )

EDIT : seems the w64 failed again, with more than 30 minutes between last good compile binary and now...

dotsam · June 17, 2020, 8:42pm

Thanks Nick! This has solved my test case, but I'll keep putting it through its paces. Caching is just corner-cases all the way down!

ookla-ariel-ride · June 18, 2020, 2:33am

here is a visual showing the impact of switching to the new VFS cache on my plex environment. It is a huge improvement, and so far I have not hit any API issues. Using the latest build 1.52.1-102 on Ubuntu 18.04.