Feature request regarding cache - more control than simple LIFO

ctlaltdefeat · May 13, 2019, 3:05pm

Usecase: I'd like to cache important/high-priority chunks of files, such as those storing metadata, without them being replaced in the cache by a read that happens to cause the size of the cache to reach its limit.

I've been trying to think about how this can be implemented.
One way is to add a flag to rclone's read operations (on cache remotes) that specifies whether this read is allowed to modify the cache at all; if the flag is set to not allow modification, the read can only read from the cache. A "serve" or "mount" would have to specify this flag globally, because its read operations are generally handled externally. The advantage of this method is that it seems to be, on the face of it, relatively simple conceptually and perhaps not difficult to implement.

Another way, which in some sense generalizes the first method above, is to specify a priority level on read operations (on cache remotes), such that the chunks read by this operation can only enter the cache if it's not full or there are lower-or-equal-priority chunks that can be removed from the cache to make way. This would require storing some priority level on the caches. Again, "serve" and "mount" would have to set priority globally. The advantage here is in the flexibility and control, but the disadvantage is that it certainly adds complexity. The current implementation of cache (to my understanding) corresponds to this scenario with one possible priority level, while the method above is sort of like having 2 possible priority levels but not exactly.

Interested to hear your thoughts, this would certainly be useful I believe!

calisro · May 13, 2019, 5:04pm

I've wanted this before. I've thought about using a filter to determine what files get kept and not aged.

ncw · May 13, 2019, 5:43pm

Some sort of cache aging which rates bigger files as more likely candidates might do it.

So maybe put stuff into a small bucket and large bucket, and age the large bucket first when trying to make cache space.

ctlaltdefeat · May 13, 2019, 6:08pm

I'm not sure the size of the file has any sort of correlation with the importance of it being in the cache, at least not in a typical usecase.

I'm imagining a sort of scenario that is probably very common among rclone users: having a whole bunch of fairly large media files in a remote. You'd like certain queries about each and every one of your media files to return very quickly, such as which audio/video/sub tracks it contains, what resolution it's in, etc. (results of ffprobe and/or mediainfo), and you wouldn't want this cached info to get overwritten by streaming a certain media file.

calisro · May 13, 2019, 6:19pm

Same in one of my use cases. For example, i'd like to keep all jpgs/nfo files on local disk and all larger files on the remote. In my use case, having a simply --vfscache-include-filter=**/*.{nfo|jpg} would be easy or similar.

ncw · May 13, 2019, 7:20pm

That sounds plausible!

Unfortunately I don't really have a maintainer for the cache backend at the moment. remusb wrote it but I haven't heard from him in a while.

ctlaltdefeat · May 13, 2019, 8:26pm

Unfortunately the file filter doesn't work in the case I described above.

calisro · May 13, 2019, 8:34pm

How do you select what needs to be kept?

ctlaltdefeat · May 13, 2019, 9:12pm

As described, by adding a flag to operations that describes whether the operation is allowed to modify the cache or not.

So I can cache .nfo and .jpg files by doing something like
rclone cat foo.jpg

but then make sure I'm not modifying the cache when streaming a large file by doing

rclone cat large.mkv --dont-modify-cache

However, I can also do more complicated scenarios where I want certain chunks of files saved by doing

ffprobe large.mkv

on a mounted remote to save that info into cache, but then

rclone cat large.mkv --dont-modify-cache

to stream the file without modifying the cache.

Animosity022 · May 13, 2019, 9:20pm

Do you normally see people play the same things over and over? In my Plex setup, it's very rare for anyone to play the same thing that often (Minus GoT lately) to warrant keeping it local.

A ffprobe takes a few API hits and very little data is sent rather than keeping copies of 40-50GB MKVs as it runs in 1-2 seconds at most.

My only use case for a Cache would be to help with buffering, but I normally don't see that either.

ctlaltdefeat · May 13, 2019, 9:28pm

To clarify, keeping entire files local is exactly what I'm trying to avoid! I am trying to cache just the chunks necessary for the ffprobe command for each media file.
This can already be done by deleting the cache and running a script that runs ffprobe on each media file, but then upon playing a file that all gets deleted for the content of the media file.

As for the amount of time an ffprobe takes remotely, I suppose that if you have low ping to the servers of your remote it could take 1-2 seconds, but in my case it's more like 5-10 seconds. That can be somewhat annoying, as quite a few media organization systems run such commands (as do I, sometimes) to figure out metadata about the media without playing it.

Animosity022 · May 13, 2019, 9:30pm

Plex/Emby do that 1 time though usually so not really a huge reason to keep those pieces local imo.

What's the backend cloud source as 5-10 seconds seem quite large for ffprobe?

ctlaltdefeat · May 13, 2019, 9:37pm

I'm using Google drive, have around 80ms ping and 50mbps download from their servers, and it can take 5 seconds. Most media players that I'm familiar with run an equivalent to ffprobe before playback even begins, so that's a 5 second delay for any user before playback can start.

Anyway, I think we're being somewhat sidetracked. Look: the aim of the cache, as far as I understand it, is to allow common queries on the remote to be completed quickly by storing some of the data locally. I think it's fair to say that the current method, while certainly useful in some cases, is rather simplistic and a LIFO when full doesn't adequately capture importance sometimes. It would be useful to give users more control over what, how, and when things get cached.

Animosity022 · May 13, 2019, 11:22pm

If you want to drop a different thread as that seems a bit much.

Neither Plex nor Emby probe the media after it's added the library assuming it analyzed on add, which is the default for both of them. After being added, it just checks the size/timestamp and validates it's the same and does its thing.

[felix@gemini Avengers Age of Ultron (2015)]$ time mediainfo Avengers\ Age\ of\ Ultron\ \(2015\).mkv | tail -1


real	0m1.538s
user	0m0.091s
sys	0m0.021s

and ffprobe is super fast.

time ffprobe Avengers\ Age\ of\ Ultron\ \(2015\).mkv | tail -1

real	0m0.819s
user	0m0.071s
sys	0m0.009s

ctlaltdefeat · May 13, 2019, 11:56pm

That's right, Plex and Emby probe it only once. I'm mainly concerned about the delay before playback time. Regarding the 1 second you get rather than my >5, if I run ffprobe on a server with a low ping to Google's servers and high connection speed, then I also get around 1 second. So there's no doubting that under certain scenarios it can be manageable.
But under normal condition, at home, it's over 5 seconds. This, despite the connection speed being comfortably fast enough to stream everything once it's started playback.

system · August 12, 2019, 12:12am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.