Slow Library Scans - Remote Mount

antigravity · September 3, 2019, 11:07am

Hi Everyone

I have a two server setup for Plex. A Feeder Machine that uploads content to a GDrive and a Media Machine that has that drive mounted and reads from it.

The problem with the above is Plex scans take 2+ hours.

I can't implement dir-cache-time on the Media Machine, because it then won't immediately detect any changes made by the Feeder Machine.

Is it possible to implement dir-cache-time but have any changes in the directory structure detected by the Rclone mount and updated in the dir-cache?

Example

Loaded into Media Machine cache:
Show/Season/Episode 1 Old
Show/Season/Episode 2 Old
Show/Season/Episode 3 Old

File Updated on Gdrive by Feeder Machine
Show/Season/Episode 3 Old >> Show/Season/Episode 3 New

Media Machine Cache is updated with only the detected change
Show/Season/Episode 1 Old
Show/Season/Episode 2 Old
Show/Season/Episode 3 New

This way the Plex scans would still be read from memory and be a lot faster?

Animosity022 · September 3, 2019, 11:13am

dir-cache-time doesn't impact polling on another machine. The polling interval is 1 minute by default so any changes are detected by polling.

You should still run a long dir-cache-time.

You could even move the polling interval down to pick up changes quicker:

  --poll-interval duration                 Time to wait between polling for changes. Must be smaller than dir-cache-time. Only on supported remotes. Set to 0 to disable. (default 1m0s)

A 2 hour plex scan means things probably aren't analyzed and you need to let the first scan run so it can analyze all the files and get all the metadata.

antigravity · September 3, 2019, 11:24am

Firstly thanks for assisting so quickly!

So just to make sure I'm explaining myself correctly and that I understand.

Machine 1 = Processes files and uploads them to GDrive
Machine 2 = Has an Rclone mount of the above GDrive and Plex streams from it

If I implement dir-cache-time of 96hrs on Machine 2 doesn't that mean that the mount on Machine2 won't detect any new uploads or file updates by Machine 1 for 96 hours?

What I want is for Machine 2 to store the directory in memory as a cache (for faster scans) but update the cache regularly with any changes detected on the mount- made by Machine 1

Animosity022 · September 3, 2019, 11:32am

Nope. It's impacted by polling time only. Changes are detected by 1 minute by default.

You can remove the number of machines from the question. The way it works is that it polls for changes. If changes are detected, it invalidates the cache and does a new request for that particular file. It's only grabbing the metadata information from the directory structure.

If something is taking 2 hours + on a scan, it's not a cache issue. It is analyzing new files and that needs to finish. Once the first initial analyze is complete, it should be quick (minute or two).

antigravity · September 3, 2019, 11:43am

Nah I've checked, all my files are analyzed (ran a bash script) and have completed multiple scans.

When I turn on dir-cache-time, the scan goes from hours to minutes.

I was just concerned about the cache not being showing any new files that have been added to the cloud folders for 96 hours.

If polling checks the cloud folders and updates the cache every minute - then that solves my problem!

Animosity022 · September 3, 2019, 11:56am

Remove Plex from the equation and run a timed find.

I just did a test mount and a fresh find.

There is dependency on the number of files/folders that you have and how it's setup but 2 hours is way too long for a fresh directory listing.

I usually prime my directory cache with a rc command after I mount it. This basically does a find but it has the ability to use fast-list automatically and completes much faster.

/usr/bin/rclone rc vfs/refresh recursive=true

I have this as number of directories/files:

felix@gemini:/GD$ find . -type d | wc -l
3492
felix@gemini:/GD$ find . -type f | wc -l
30619

I wonder if you hit the situation where the cache time expires since the default is 5 minutes and it loops a bit. I haven't tested that in quite some time since my library has grown.

What's the size of your library and number of directories/files?

My fresh find took about 13 minutes to finish:

felix@gemini:/Test$ time find . | wc -l
34111

real	13m24.212s
user	0m0.117s
sys	0m0.154s

antigravity · September 3, 2019, 12:02pm

I'm in the middle of a scan at the moment- so I'll have to wait for it to finish before running find.

I had the same problem with Emby. Scans were taking hours. Implemented dir-cache-time of 96hrs and the scan dropped down to 3 minutes!

If dir-cache-time still polls the cloud directory and updates the cache directory every minute, then that should solve my problem.

But out of curiosity- i'll run time in the morning- see what result I get

Animosity022 · September 3, 2019, 12:05pm

dir-cache-time doesn't poll anything. It's the amount of time the directory and file structure stays in memory before being invalid.

So if you list a directory, it checks the cache, if invalid, it does an API call to get a new listing.

polling time is what checks the remote for changes. If it detects changes, it invalidates the directory, which when you list that directory, causes a new API call to get a fresh listing.

antigravity · September 3, 2019, 12:29pm

And the polling will update a specific file or directory reference in the cached directory listing whenever it detects a change? And not wait 96 hours to update the cache listing?

Animosity022 · September 3, 2019, 12:39pm

Yes, you do not want 96 hours as polling picks up the change.

thestigma · September 3, 2019, 2:32pm

Think of it like this:

--dir-cache-time is the maximum age (timeout) for the information. After this it is forced to refresh from cloud it every time you want to read that metadata

--poll-interval Is the interval that the mount asks "Hey Gdrive, did anything change on your side since I last time I checked at ((time)) ?" If Gdrive has knows about changes since then it will send a list of those changes and the mount will integrate these info the cache to make it up to date again. This is much more efficient that re-listing everything of course - although the polling request in itself does use an API call each time it checks.

The reason you have two flags that kind of do similar things is that not all cloud systems support polling. These will have to only rely on --dir-cache-time. Gdrives do support polling, and so they can use very large (practically infinite) --dir-cache-time and get their updates though polling instead.

TLDR:
set --dir-cache-time to a very large time
set --poll-interval to the maximum time you want to wait before detecting changes that were made by third-parties

You may also consider using a high --attr-timeout . While dir-cache-time caches the folders, --attr-timeout caches all the file attributes like size, modtime ect. and this is often even slower to re-list.

However, if you do you must be aware that this can carry the risk of data corruption in very spesific instances. Specifically - if A lists some files, then B changes the size of that file - and then before the polling interval picks up on that change, A tries to modify that same file, then corruption can potentially occur.

So if you have a multiuser environment where files are frequently edited by multiple users then this may not be a good idea - at least not unless you do some kind of extra revision backup with --backup-dir or something

But if dealing with only one or a couple of uploaders not usually modifying the same files and you run a low polling rate (as low as 10s may be workable as thats about 1% of your API quota) then the risk will be quite low.

Lastly, if I one of your 2 systems only reads from the drive then it is perfectly safe. The possibility of corruption can only happen with at least 2 independent sources modifying files (sizes specifically). If this is the case then use a high --attr-timeout freely and enjoy a mount that feels almost as snappy as your regular harddrive

antigravity · September 3, 2019, 10:44pm

Excellent thank you both for the detailed replies.

The second mount only reads - so based on the advice, I should do the following on the read-only mount (Machine 2)

--dir-cache-time 96hr
--poll-interval 1m
--attr-timeout (what would be considered a "high" attribute?)

Just with --attr-timeout, sometimes the only change made by the uploading machine (Machine 1) to an existing file will be the file size - I need the remote mount on Machine 2 to pick this up in polling and update the cache. Will a --attr-timeout impact the ability to detect a file size change?

thestigma · September 4, 2019, 1:37am

you can set it to the same as the dir cache. I set both to a year. not that this matters really since the cache will reset on a dismount anyway. It just needs to be large enough to not matter in practical use

your setup is correct in the case of one machine only reading yes. You can use all three.

polling will pick up any change in metadata including size so this will be fine. When only reading rclone should be able to detect if the size was wrong in most cases anyway even should that happen so id consider this a non-issue.

lastly I would like to mention that it is pretty easy to set up a "cache warmup" script that automatically does a full listing at the very start of
mount. that way your cache is already fully loaded and good to go for snappy responsiveness for all later requests. without this your cache will slowly accumulate on demand. That works fine too, but your first listing won't be as fast as it could have been if it was already in cache. I find it most useful in manual browsing as you get frustrated real quick if navigation isn't really snappy and responsive.

If you want more details on this leave a reply and il get back to you tomorrow about the details when I'm at a proper keyboard. it involves using RC and sending a --fast-list cache update command to that from a script tied to whatever mounts your drive. I believe Animosity uses this as well (me on windows, him on linux). Depending on which you need you can probably get some readymade examples to use or modify to your needs.

antigravity · September 4, 2019, 2:28am

Perfect thanks again that's excellent advice and I'm sure it'll come in handy for others.

I have seen animosity's RC script on his GitHub, I'll grab it and use it as I'm on Linux also.

Thanks for the advice, truly appreciate it!

thestigma · September 4, 2019, 3:07am

Glad to help. Just be aware that when it comes to Plex this isn't a golden ticket. A lot of what Plex scans for it probably going to need more info than just the basic metadata that the cloud stores (name,size,modtime,created ect.). It probably looks for things think length, codec info and much more. and in those cases it's going to have to read the start of those files even if you have them cached... at least it can list them faster. It's generally recommended to be conservative with Plex settings as it comes to advanced scans as the worst of them (like deep analysis I hear) will literally read every file in your library fully... which you probably don't want to happen over an internet connection automatically.

See Animositys posts about recommended settings and I think he has info about Plex there. I don't use Plex myself so I can't help much with those details.

antigravity · September 4, 2019, 3:26am

Thanks yeah the initial scan of a new file takes a little bit longer as it does some file analysis...

I'm just trying to find something that helps speed up the secondary scans that seem to go super slow without dir-cache-time.

My scan time goes from 4 hours to 5 minutes with a long cache time set.

system · December 3, 2019, 3:35am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.