A short primer on pre-caching (or pre-listing) technique

thestigma · December 1, 2019, 4:16pm

@Monarch_de_Boulogne requested that I share some of the information I gave him via PM so that it might benefit other users. At some point I will try to aggregate some more of this information and practical guides into some sort of megathread and or wiki article, but for now I will just leave this here for those who are interested.

If anyone has followup questions on this topic I will also take the time to reply to this here. Maybe I can use this as a reference in the future for others who are curious about the topic...

On another topic

Thank you for the kind words - and it will be my pleasure to elaborate.

First of all - what is precaching or "pre-listing" as would be more accurate...?
It is a method by which we ask the cloud-drive at startup to tell us about all the files on the disk up-front, rather than asking for "what is in this directory?" each time we open a new one while we navigate around in the herarchy.

Why is this useful?
Asking the server to list a folder takes time, because there is latency over the internet + some latency for the server to perform the request and send the results back. This is why navigating a cloud-drive feels more sluggish and it takes a half or a full second to open folders. This is also why searching a cloud-drive takes ages. pre-fetching this information to local RAM however actually makes these tasks instant or near-instant. Opening folders while happen as fast as on an SSD, and searching through several TB of data may only take a few seconds rather than 15-20 minutes (actually much faster than the best SSD).

How is it done?
First we need to run the rclone Remote Control (RC), either as a standalone process - rclone RCD - or in addition to another process (typically an rclone mount command) by using the flag --rc
We then ask the remote-control to fetch us a full list of all files with vfs/refresh . Doing it this way rather than through the mount itself (which you also could do) will allow us to make use of --fast-list . Without going into too much technical details - this is a recursive list. Basically, instead of asking for the contents of each folder one-at-a-time we ask the server to give us everything all at once. This is as much as 15x faster than the normal method. For reference, it takes me less than a minute to pre-list 4TB of data in complicated and messy hierachy.

What are the requirements, restrictions and downsides?

For a pre-listing to be practical, you usually want to run it on a backend (cloud provider) that supports polling . Polling is rclone's term for the feature often refereed to as " changenotify ". This feature is pretty important because without it your pre-listing will only be a still-image of what the drive looked like at the time you ran it - which is only useful up until the point where the contents have changed (or otherwise you won't see the new stuff). With polling/changenotify however, rclone will get a message saying "Hey, I just got a new folder with these files" - and rclone will go "Cool, I will add this info to my local cache so I am up-to-date". Polling happens by default at 1min intervals (can be set lower too), so this basically gives you all the performance benefits with none of the downsides.

Not all backends support --fast-list either (or recursive list as it may be called). But if it doesn't support this then it almost certainly doesn't support polling either... If you tell me your provider I can simply check this for you. (there is a feature-list of providers somewhere here on rclone.org documentation).

While it is not technically a requirement, you probably want to run the Remote Control (RC) so we don't have to list 15x slower. This only takes like 5MB and is pretty easy to set up, so it is not a problem.

It won't work instantly if you just started the drive. It takes some times (as I've already indicated) to finish the listing. However, you can use the drive normally while this happens. You just won't have the speed benefits until it is complete. Not much of an issue. The benefits will stay as long as the drive remains mounted.

It takes some RAM to store this listing information. For reference it takes a little over 200MB of RAM for me to pre-list about 90.000 files. The RAM requirements will scale roughly linearly with the amount of files which you can see easily via rclone size remotename:

When it comes to searching, pre-listing only helps as long as you only search for basic attributes (the ones that the provider stores). This usually means: size, name, last-modified and created-time. if you search for extended metadata like for example thumbnails for pictures, the running time of a movie or the artist in an MP3 rclone will have to actually check the file for that info because it is not data it got in the listing. This takes much much longer. You would be well advised to disable the displaying of such information by default in whatever program you use to view your files. For example in Windows Explorer you would want to "optimize folders for documents" so that it only shows basic attributes by default. You can always re-enable it for spesific folders if you need to (at the cost of that folder being slower to open and search).

Lastly, an important note. I do not recommend using pre-listing for a very spesific scenario unless you are aware of the potential risks:
If your drive is accessed by multiple users that write files to it
and if one of the people that upload files upload it via the mount
and if these files get uploaded to the same place with the same filenames
then under very spesific circumstances it could potentially corrupt a file.
Specifically what would need to happen is that someone changes the size of a file, then the person uploading via mount must try to change that same file before the change is caught by polling and it must be one of the spesific operations that rclone can not detect by itself.

So this is a very spesific scenario that will only apply to a few people and have a very low risk of ever happening - but it is something you need to be aware of an understand, because I do not want to be responsible for you losing any data due to me not informing you - even if it is very unlikely to happen. If you only read data, or you don't upload via the mount or you only upload to the drive from one place at the time, or you don't upload to the same area/filenames at the same time - then there will be no risk. This isn't something to be afraid of - just be conscious of it - and if you have a scenario that you are not sure if is safe or not, then just ask.

Class over! (this is what happens when you ask me to elaborate on something lol, so sorry...)
Questions?

thestigma · December 1, 2019, 4:24pm

One thing I forgot to mention above is that these two flags wil also be required for the mount command:

--attr-timeout 8760h
--dir-cache-time 8760h

Usually these values are quite low, invalidating the cache after a short time. These values are low by default because not all backends have polling, but on a backend with polling we can keep the cache data indefinitely and rely on updates to keep the information fresh instead of re-gathering the full listing every time we need them.
The specifics numbers here are just an example - I've just set them to be a year to effectively "disable" them. Without setting these timers to a high value it won't make much sense to pre-list as we will just evict most of that information from cache very soon thereafter (1 second default for attr and I think 5 min default for dir).

attr refers to caching file-attributes (size, modtime, name and createdtime for most backends).
dir refers to caching the directory information/structure.

@Monarch_de_Boulogne This information is relevant to you as I've forgotten to tell you this before, sorry

whiteloader · December 9, 2019, 7:09pm

I am trying to follow your instructions, please correct me if something is wrong or not ideal:

Firstly, start the mount with the RC:
(NB: Windows Powershell)

.\rclone mount cryptomator: Y: -vv --rc --attr-timeout 8760h --dir-cache-time 8760h --rc-user=time --rc-pass=xxx --rc-web-gui --fast-list

then, open a new shell and trigger the remote control:

.\rclone rc vfs/refresh --user time --pass xxx

is that more or less the idea?

How do I know it's working? I can't really tell unfortunately.

thestigma · December 9, 2019, 9:43pm

I think it the answer from the RC may be silent unless you add -v to the remote control line. But then it should answer "OK" when the operation is complete.

I would just recommend you just wait and get those scripts I promised you here:

That should be ready-made to work in a fairly intuitive way.
I know I am late on this, but I have not forgotten. I've had to deal with some health issues the last days. Starting to feel a little better now though. I will make an effort to get it done during tomorrow - and if for some reason I am unable to do the things I wanted I'll just dump you the scripts I use as-is. They work fine already but I wanted to clean up some rough edges so they are easier to use on any system - not just my spesific setup.

whiteloader · December 9, 2019, 9:50pm

no worries, it's not urgent, i just got curious
Please don't force it in regards to your health – take your time and get well soon!

I agree it's better to have the "nice" version, even if it means waiting some extra days or weeks. I know all of you guys are doing it in your spare time and I appreciate it.