Help setting up rclone gdrive gcatche gcrypt with seedbox

whiteloader · December 1, 2019, 6:01pm

No need to be sorry! I really appreciate the detailed answer.

I think I came a lot closer to understanding it now. I have actually two very different use cases and I realized I will need to make 2 completely different mounts due to the nature of the different access patterns.

For the sake of the argument let's focus on the second one for now.

First use case – Plex: I think for this I will most likely need the VFS Mount? However the native plex integration of the cache backend sounds also interesting (seem to have a higher start delay thought)
Are there any up-to-date community approved best practices for that?

Second use case – Torrent: My plan is to long term seed some less frequently requested linux distributions. Based on your explanation I think the cache backend would make most sense here, but I look forward to your feedback.

I want to store my less popular torrent content on an encrypted Google drive and seed it long term. Some of them may have traffic every day, while others won't be accessed for a week, a month or even longer. Judging from your answer I would assume that each requested piece [using VFS cache?] makes an API Call to google and I would quickly get banned. I am actually doing it at the moment on my local computer with Google Filestream and get 403 banned by google multiple times per week.

What I came up with, was using the cache back end an set a very high limit for cache size like:

RCLONE_CACHE_CHUNK_TOTAL_SIZE = 1500G

My assumption is, that if any part of a file (any piece) was requested by a peer, it is extremely likely that more or even all other parts will be needed in the very near future. Let's say all Linux Isos that I seed make up 10 TB, and I dedicate 1.5 TB as "Cache", it would mean that my seedbox always keeps the most relevant content cached and fetches the unpopular content only if needed while discarding the local copy of the files that were not used for the longest time. The really hot content would be always cached.

This way I would create very little API Calls with google (and also very little traffic) but only if the whole file was downloaded as soon as the first piece is requested and not all the pieces individually. Or at least the requests should be way bigger then the average piece size to download multiple pieces at once.

Can I control the minimum "request" size? So let's say that the server downloads at least 16 MB, even if just 1 MB of the file is requested by the torrent application? (Is that what RCLONE_CACHE_CHUNK_SIZE does?)

Is this chunk size always aligned at the start of the file?

To help me understand could you please use this example to show me what's going on?

Let's assume that I set the chunk size to 8 MB and now the torrent client tries to read the part between 14–15MB of an uncached file. Will rclone download (and store in cache?) the part

A) 14–15 MB (only the requested part)
B) 8–16 MB (the full chunk that contains the requested part, aligned to the start of the file)
C) 14-22 MB (an 8 MB chunk with a starting point the corresponds to the start of the request file position)
D) The whole file?
E) Something else?

(as counted from the start of the file)

I guess the cache back end would do B or C while the VFS cache would do A?

How could I tweak it?

I believe I should go for a large chunk size to reduce overhead in the cache db? What do you think?

Thank you, I am highly looking forward to your answer!

thestigma · December 1, 2019, 11:36pm

Yes, Plex isn't directly cloud aware (like pretty much all programs aren't) and a mount will be the only way it can interact with files on the cloud.

I can't say I know of any good Plex-guides for the cache spesifically here that I know of (there probably is, but I am not omniscient). The cache-setup is pretty straight-forward however. I think the main benefit to running a cache with Plex is that it would help alleviate some of the problems associated with aggressive scanning for metadata - but I would suggest that this is better dealt with by turning those scans off and rather doing them manually once in a while if needed. Animosity is a guru here on Plex/Linux and he ultimately decided to not use the cache-backend at all.

He has a thread here detailing his setup:

While not all of the info there applies to you (his setup is a bit more advanced than most) this thread slo contains a lot of good info on Plex-use on rclone in general and best-practices around Plex-settings and how to make Plex behave well on rclone.

I would definitely use a cache to keep some of the hot data local. Either type would help with that.
I think I would personally go with the VFS cache here to avoid over-fetching data when you get a request for just a small bit of data (as is not uncommon on a torrent). If you just make sure the VFS cache is properly set up to retain data up to a certain size I think that should do the trick.
But yea, the cache-backend would also work here. I just don't think the forced chunk-prefetch approach is ideal for this situation.

It's not quite as bad as you imagine.
When a torrent-peice request comes in, rclone will open a read-segment of the file (by default this is 128M but can be configured otherwise). It can then read any file in that segment and seek within it without re-opening the file (which is the biggest limitation).
I have never come near to stressing the API while seeding, although my seeding is probably very modest compared to your needs. You have 1000calls pr 100 seconds to use, so this is fairly substancial, especially as peers will often request a range of segments that fall within the same segment and can be fetched as one operation.

Let me be clear that there is no such thing as an "API ban". The worst that can happen is that you max out your quota for the 100 seconds. Rclone will automatically keep track of this and throttle down a little bit if it needs to in order to keep you within the API limit. You will never get locked out of the API entirely - or at least I have not seen this ever happen. You can run at the 10calls/second average 24/7 without hitting the absolute max daily calls (you would need multiple concurrent users on the same key for that to happen). That's close to a million calls a day.

But sure, at some point you will probably end up API limited if you are seeding hundreds of torrents to hundreds of peers at the same time. It is hard for me to gauge how at what point this would happen because I've just never tried serving that kind of volume, but as long as you keep the hottest files on cache (ie. the "fresh this week" stuff) then I think the API can probably handle a pretty decent volume of the older stuff that is just requested occasionally. Ultimately I think this is something I think you just have to test and get a feel for. I would be very interested to know your results though.

You can keep an eye on API-use here and know exactly what kind of load the API is seeing:

Sure, the larger the cache the better as you will need to fetch less remotely.
If you wanted to use the VFS-cache instead you'd use this to achieve much the same result:
--vfs-cache-max-age 332880h
--vfs-cache-max-size 1500G

I think I've mentioned this before though.
The reason I don't think the cache-backend is ideal for torrents is that I believe it will fetch a minimum of 1 whole chunk, and if you have more than 1 worker thread it will fetch that many chunks. That would be a lot of inefficiency if it's just a small request. It will also be harder on the API as I think those chunks will need separate calls.

This is true. I think it would actually be harsher on the API though as I said.
Using the VFS aproach you'd keep the hot files in cache too. It would just not re-cache something once it has been evicted. However, for torrents especially - once a file is no longer "hot" and recent, it rarely has a resurgence of popularity, so I don't see this as a problem.

With chunking I don't think you can download several at once. They are going to be fetched by cache-backend one-by-one I'm pretty sure. You could use very large chunks, but that would add a lot of delay before responding to any request - and you'd get massive overshoot if only a small bit of data was requested.

Correct. With cache backend you download 1 full chunk minimum. Note that it will actually fetch several chunks to start with if you have more than 1 --workers . If you had 8 you'd get the data requested + the next 7 for example. The cache bcakend was designed for media streaming I think, so that kind of sequential prefetching makes sense there, but not so much for torrents perhaps.
Normally rclone would just fetch as much data as whatever the application requested (which in a torrent client I guess would probably be a set of pieces that may or may not be contiguous)

I hope we can get some more robust and efficient general caching later on in the VFS eventually - but these things take time

No, not as far as I know. If you requested data from the middle of the file it would grab the chunk that data fell under. At least that is my best educated guess.
EDIT: reading your example below then yes it would be "aligned from the start" as you say.

Pretty sure the answer is B
The only chunk that will have an odd size will be the last one at the end of the file. This is made this way so that you can download pieces here and there - and later on these can match together without overlap.
As mentioned, with multiple worker threads (say 8) you may also in addition to chunk#2 get chunk# 3-9
And if you set it to use only 1 worker then it would only be fetching chunks 1 at a time which also isn't ideal here. So this is one of a couple of reasons I don't think the cache-backend is ideal for this use-case.

Chunks should be large enough to have time to TCP-ramp up to decent speed, so I would hesitate to use less than 32M (or even 64M), but of course that come with the cost over over-fetching more data.
I don't think you need to worry about the database efficiency. It's local so it will easily handle it.

TLDR on this whole thing, I think I would reconsider the cache-backend of the torrent-use.

Heavier on API calls
Overfetching of data
Problem with multiple worker threads fetching multiple chunks we may not want (not something that was not left as a setting unfortunately).
Only big benefit is that it can re-cache old and less popular data (which seems like ti wouldn't be that important, or even inefficient when you look at how torrent popularity typically trends over time).
It was intended for media streaming (a large buffer for fairly predictable sequential reads and you are really asking it to do a polar opposite job here).

While the VFS-cache might be slightly "simpler" in some aspects, I think it will do a better job here honestly - and with the benefit that it will improve and gain more flexibility as development proceeds in later versions.

But the choice is up to you - and I will do my best to assist either way you want to proceed

whiteloader · December 3, 2019, 9:46pm

First of all, thank you again for the detailed explanation!!! I think I finally get it now, how the cache backend works

The reason I am so afraid of bans, is that it happend serveral times at my local computer, thought the setup with google filestream differs very much from the rclone setup. I frequently get the following error:

{
"error": {
"errors": [
{
"domain": "usageLimits",
"reason": "quotaExceeded",
"message": "The download quota for this file has been exceeded",
"locationType": "other",
"location": "quota.download"
}
],
"code": 403,
"message": "The download quota for this file has been exceeded"
}
}

After this happens, my google drive is pretty much unusable and I can't access any file for about 24 hours. Happend actually just today again -.-

Anyway I see why you recommend the VFS-Cache. I would like to understand the VFS Cache a bit better: As far as I understand it will only read/request what is actually needed. So in our example A) 14–15 MB and not more ?

Does it mean it will actually write every piece into the cache that it had served? If it's not working with chunks I image it would be quite difficult to store all this little pieces in the cache separately.
So if I seed a torrent with 6000x1MB Pieces it will actually download all the 6000 pieces separately (potentially opening the file less often then 6000 times) and then writing all the 6000 Pieces in the cache separately?

Honestly I can't think of a way doing this efficiently, but I am curious to find out

Animosity022 · December 3, 2019, 9:52pm

That's a quota error not a ban as it resets your quota every 24 hours. You can download 10TB per day. Depending on how Google File Stream handles partial files, you could eat that up quickly, but that's question for Google File Stream.

If you see that error in rclone only, it means 2 things.

You are running an ancient version of rclone
You actually downloaded 10TB in a single day

whiteloader · December 3, 2019, 10:00pm

Based on the error message is it clear to you I am running into the 10 TB Limit? Do we know this for certain?

Possibly Google Filestream is requesting the whole file each time a piece is requested? that would of cause add up pretty quickly ... and it would explain a lot

thestigma · December 3, 2019, 10:36pm

This error will be due to one of 2 things:

You have exceeded the 10TB download limit for the day
or
You have had an excessive amount of requests for a spesific file in a short amount of time.
I have never seen the latter in action myself, but I've read about it - and I believe the main function of this is to stop people from using Gdrive as a general file-server. I know very little about the limits of what trigger it, but I've heard it's something like "a few hundred downloads in a couple of hours".
I am unsure if this would only quota-limit this one file or the whole drive. I would imagine the former.
This is usually most common if a file is set to be publicly shared via a link, and that link becomes very popular.

In either of these two cases - this will usually happen due to badly configured software. Remember that most software assume that it's talking to a harddrive, and thus that scanning and reading data is very easy. A clouddrive has inherently different performance characteristics we need to account for.

For example Plex does a lot of scanning that fetches more metadata than just basic attributes. It may scan the file to generate previews, performance-graphs and all sort of stuff. This will at least necessitate opening the file and reading some of it's data. Sometimes it will require reading ALL data of ALL files in the library - and if you have a multi-TB library that is going to massively tax the system and create problems... In this case, the best fix is to limit such advanced scanning procedures and only do it manually during downtime on occasion
.
You can get a good idea of what is happening via the Google API contol panel:
https://console.developers.google.com

If this shows that you've had way way more requests and file-downloads than you expect then you likely have misbehaving software working on the mounted drive, and this is something you ultimately have to fix in that software. Rclone only serves requests, it can't control them.

thestigma · December 3, 2019, 10:37pm

403 unfortunately refers to a lot of different quota/trottling things, so we have to speculate based on circumstances + the added message.

It should not be related to the API quota as that has different text associated with the error.
Having excluded the API, I am fairly sure it is one of the two problems I outlined above.

whiteloader · December 4, 2019, 9:15pm

thank you very much for this analysis! I believe you are right about this.

I have another question about the vfs cache. In the manual it says:

--vfs-cache-mode writes

In this mode files opened for read only are still read directly from the remote, write only and read/write files are buffered to disk first.

This mode should support all normal file system operations.

If an upload fails it will be retried up to --low-level-retries times.

What is the default for --low-level-retries times?
What happens if I copy something into the remote (actually copying into the cache and then start to upload), and then I quit rclone before the upload finishes. Will it continue upload after the restart? If not how would I recover from this situation?
As far as I understand the VFS Cache will only read/request what is actually needed. So in our example A) 14–15 MB and not more ? (I assume we are talking about the "write" mode here?)
Does it mean it will actually write every piece into the cache that it had served? If it's not working with chunks I image it would be quite difficult to store all this little pieces in the cache separately.
So if I seed a torrent with 6000x1MB Pieces it will actually download all the 6000 pieces separately (potentially opening the file less often then 6000 times) and then writing all the 6000 Pieces in the cache separately?
Honestly I can't think of a way doing this efficiently, but I am curious to find out

I hope you are not tired yet of my questions Looking forward to hear from you again!

thestigma · December 4, 2019, 11:09pm

Answer is 10.
Under normal circumstances it is extremely unlikely to need more than 1-3 tries to get something done. If it does then this usually indicates a serious problem.

Sadly it will not recover right now, as rclone currently does not keep a persistent list of what is due for upload. I absolutely agree this is not great - as file could be lost in the cache if rclone shuts down unexpectedly in the middle of a file-move from the OS.
I became aware of this issue quite quickly myself and discussed it with Nick. The plan is for the VFS to have persistent tracking (so that it can be re-started and not evicted from cache until processed). I am not aware of any spesific timetable for this and many other planned VFS improvements, but is at least something that is panned for. I think you should find an issue about this on github (by NCW I think) if you want to go upvote it or comment to encourage this as a priority.

I thin you are mixing up a few different things here, so let me lay out the basics for the VFScache:

For reads - it will request to open a read-segment (default 128M) but will only transfer exactly the number of bytes that was requested. As the VFScache does not perform read-caching currently, this data is not saved anywhere (although your torrent client might keep some data in it's own cache if it has that feature).

For writes (using "writes" mode) - all writes will first be written to the cache, then uploaded. This operates on whole files only, so a small change to a file will require a full re-upload. This is not really possible to "fix" anyway, as almost no backend supports partial file-writes, so it's the best rclone can do.
For a torrent specifically - pieces are not files. If you download 6000 pieces that represent 3 files then 3 files will get written to cache and uploaded.

Sidenote on torrents as it comes to cache and upload:
I highly highly recommend using a "temporary download folder" option in your torrent client, or some sort of script-trigger to move torrent files. Writing torrents directly to the cloud-drive (effectively the cache) tends to cause issues because torrent-clients are notorious about opening files again and again for each write - and leaving the files in a half-finished state in the meantime.

The problem is that rclone has no clue if a file is really done or not. It assumes anything that goes into the cache is ready to go - and will upload it. Then 2 minutes later the torrentclient writes more - meaning you need to re-upload again... and this can repeat ad-nauseum. Or even worse - if you keep low timers on the write-cache it might need to transfer back from cloud the half-finished file because the torrent-client requested it. It is a mess best avoided...

This is an issue that is not easily fixed 100% cleanly from rclone's side, so it's something you just kind of have to be aware of. It generally applies to applications that create temporary workfiles. Torrents are a very common example, but it may also apply to things like some video-rendering programs. The ideal solution is always (if possible) to let these programs have a local temporary-files directory to do their messy business in before presenting the finished file to rclone for storage. Often there is an option for this in the program. When there is not - some simple scripting can be used to fix the problem.

That said - I have discussed this issue with Nick and we've come up with some ideas on how to mitigate most of not all of these issues by just having a delayed-upload function. Ie. the VFS not uploading files until they have gone "cold" and haven't been changed for a while. While this is not perfect, it should prevent nearly all cases of the problem in practice, and the few that slip though the net won't break anything. It will just cause a little inefficiency once in a blue moon.

And no, I never find intelligent questions in honest search for knowledge "annoying". As long as you don't use me to ask for thing you could easily check yourself then ask away.

whiteloader · December 5, 2019, 9:17am

Thanks again

I am glad to hear that is something you are working on!

Thank you for the clarification. I actually installed the windows version of rclone myself now, and can confirm this behavoir. However, if this is the case I don't see how I would use it for seeding to keep my hot torrents cached? They would never be cached/re-cached unless I write to them?

Okay got it. I assume that means that a file has to be fully downloaded to write to any part of it?

100% agree this makes sense

That would be great. I believe this is actually implemented in the cache remote, isn't it?

One more thing: While playing around I noticed a bug. I moved on video file to the (VFS write cache enabled) crypt/gdrive volume and it started uploading. Then I moved a second file there and the first disappeared from the mount. (but still both were uploading)
Only after it was finished the first file reappeared again.

I appreciate your attitude and patience! Of cause I'll always make an honest effort to find the answer myself first if possible

thestigma · December 5, 2019, 3:23pm

Well that is correct - but every torrent is going to get downloaded at the start. This adds them to the cache. They will then not get evicted from the cache until (1) The cache reaches it's size-limit and (2) this particular torrent's files is the least recently accessed among those in the cache. Assuming that the cache has a decent size, that should do a good job of keeping recent and hot torrents in cache.

Of course the cache works on a file-by-file basis, but you get the point.
They will never get re-cached like this though. Not unless you changed the files are re-wrote them - which you typically won't be doing on torrents. It's not the perfect solution, but I suspect it will will work fairly well anyway. A good enough solution for now - and in later versions we should get more functionality to play with here under the same system.

No, quite the opposite. The data has to go somewhere (or else you'd have to keep a whole torrent in RAM which wouldn't be practical). What happens is that the file is written piece by piece. Each piece is not saved as a file though, but written into different areas of the same file on disk. You can think of it as placing down puzzle-pieces one by one - and when all the pieces are put in the right place it makes a whole file. Before it is completed the file will just be a broken mess.

Correct. The cache backend has a delayed upload feature, and it would accomplish some of the same goals. In my testing I found both the temp-upload and cache-writes functions to be quite buggy though, so I am very hesitant to use these. Just to mention a few - the temp-upload sometimes puts files in the wrong place, and the cache-writes makes the database go nuts and write an inordinate amount of data (re-writing the entire database multiple times a second - continuously). It is unlikely that most of these problems will get fixed now since the main author went MIA a while ago.

Probably just a minor listing-cache thing. It might be solved as easily as just F5'ing the folder, or just waiting a little bit. If not then something in your listing-cache timers are not set right.

The way this works is that as soon as a file enters the cache it is considered by the mount to be "on the mount" and will show up there as if it were part of the cloud-drive. The upload then happens in the background and the cached file can not be removed until it has reached the cloud (except if the program was unexpectedly ended as we talked about earlier). This should ensure that the file remains accessible the whole way though with a seamless hand-over between the systems. Very useful since it can mask the actual upload time, and in theory you shouldn't have to think about anything else after it enters the cache - it should be immediately usable as if it was already done uploading.

whiteloader · February 16, 2020, 4:31pm

Hi there,

I finally got my seedbox ready and after some weeks of playing around I wanted to report back with my findings. I tested both, the cache backend as well as the vfs cache and my verdict is:

I still believe the cache-backend would fit my occasional seeding needs perfectly. However the performance is incredible bad unfortunately. It's exhibiting a huge memory leak as well as very high CPU utilization. That's why I am using the VFS-write cache (so not really caching actually, more like direct reading). I have considered to use the VFS-Full cache which would be closer to what I want, but from my understanding it would not deliver a single byte until the whole file is downloaded (correct me if I am wrong) which would presumably cause a lot of timeouts with large files in rtorrent.

About the cache backend:
I believe the bad performance I have seen is most likely due to a bug in the implementation, but it seems the problem is more or less known, but has not been solved yet. Unfortunately I am stuck with rclone 1.47, because it's not my server, but I have not seen anything significant in the change log in this regard since then (if I am wrong I would have a strong argument to ask the server owner for an upgrade).

More specifically:

If I don't use --cache-chunk-no-memory the performance is actually great, but the mount is crashing in a matter of minutes if used (e.g. by repeatingly accessing the same file or different files), because it consumed huge amounts of memory (>20GB), so I have no other choice then to use it which significantly degrades performance.

I tried with and without -buffer-size=0M, but without any significant difference. For example I use:

rclone mount crypt: ~/SB -vv --fast-list --cache-tmp-upload-path ~/T --cache-chunk-clean-interval 15m --buffer-size=0M

chunk_size = 64M
info_age = 24h
chunk_age = 9999h (seems to be deprechiated)
chunk_total_size = 500G

it works, but it is slow and very CPU hungry. It still eats a significant amount of ram, but not as bad as without --cache-chunk-no-memory

my current favorite mount is this one:

rclone mount crypt2: ~/SB2 -vv --fast-list --buffer-size=64M --vfs-cache-mode writes --dir-cache-time 96h --vfs-read-chunk-size 64M --vfs-read-chunk-size-limit 512M --vfs-cache-max-age 96h

As I said, I would very much prefer the behavior of the cache backend (read-caching), but for now that's my best shot

thestigma · February 18, 2020, 1:21am

Correct. I would not recommend it except in very special situations. It is terrible for most use-cases.

Yes, I know of a several bugs myself. The bad news is the cache-backend is kind of dead. The author disappeared and it has not had a maintainer for a long time now. NCW is more focused on extending the VFS to have similar functionality rather than trying to fix someone else's code (which I totally agree is the right choice). Besides, the VFS-cache could do the job much better ultimately due to tighter integration.

So right now we are in the awkward situation of having to choose between no read-caching until that is implemented into the VFS, or using an old and flawed read-cache module with known bugs that are somewhat unlikely to get fixed anytime soon (unless some hero comes swooping in to take over the project - which sometimes happens). Sorry - but at least it's on the to-do list and it will be solved at some point
TLDR: A newer version won't help for the cache backend. It is the same as in 1.47. Lots and lots of other good changes though so...

This should be used when using cache-backend. Since buffer-size will only apply to the rclone core it will only be buffering from the cache (which is local, so there is very little point). usually with no cache-backend it would be buffering between rclone and cloud (which is super helpful) but the cache-backend is sitting between these two now so the only "buffering" that will happen is entirely decided by the chunk-size you use. Setting it to 0 at least does not just waste memory doing nothing but other than that I don't think it will do much.

Your config looks fair enough. I agree that cache-writes is the best (if imperfect) solution so far. Cache-backend had too many issues for me to use it personally, so I use the same. I will keep reminding Nick to not forget about read-cache functionality once in a while but he has a lot of stuff already on his plate so he has to prioritize

thestigma · February 18, 2020, 1:24am

Actually... come to think of it, it may be theoretically possible to do something smarter now with the new multiwrite union (in testing). If we can selectively upload files based on last-access we can at least weed out the least used files from a local pseudocache.

I will do some fiddling with this as I test out the multiwrite union and see if perhaps has some merit to it. Not as good as a built-in true read-cache, but it may be a step better at least...

ncw · February 20, 2020, 1:09pm

That is correct. It is also something I'm working on at the moment as part of general VFS improvements.

thestigma · February 20, 2020, 3:25pm

Does that mean imply VFS read-caching is on the way? ? .... excited

ncw · February 21, 2020, 7:42am

Step 1 of cache+vfs unification I decided will be read caching. This will include

partial caching of files when downloading
delayed upload of completed files

Step 2 is metadata caching which will be a bit more work as I want to add serialization methods to all the backends.