New Feature: vfs-read-chunk-size

B4dM4n · May 20, 2018, 11:28am

Recently a new feature got merged into rclone beta, the --vfs-read-chunk-size flag for the mount and cmount command.

Probably all cloud providers will limit the daily download volume and count the requested bytes therefore. The problem is, when a file is opened for reading, rclone will request the whole file from the remote. This will add the whole file size to the download volume, even when only a few bytes are read.
Additionally every seek call will request the file from the new offset until the end, which all add up to the download volume.

This is where --vfs-read-chunk-size will help to reduce the download volume. When --vfs-read-chunk-size 128M is given, rclone will only request 128 MB from the remote at a time. Once a block is read til the end, the next one will be requested.

There is also a companion flag --vfs-read-chunk-size-limit, which controls if the chunk size can grow when multiple blocks are read in a row.
When --vfs-read-chunk-size-limit is greater than --vfs-read-chunk-size, after every chunk read the chunk size will get doubled, until the limit is reached. A seek call will reset the chunk size to the initial value and --vfs-read-chunk-size-limit off will let the chunk size grow indefinitely.

Using the following mount command, I'm able to run Plex on a Google Drive remote without cache and don't get banned.

rclone mount \
  --dir-cache-time 48h \
  --buffer-size 64M \
  --vfs-read-chunk-size 128M \
  --vfs-read-chunk-size-limit 2G \
  gsuite-crypt: /mnt/gsuite

I hope this clarifies the new feature and helps people to avoid bans in the future.

Edit:
Here are some additional notes about the usage of this feature.

Using chunked reading only makes sense when used without a cache remote, as the cache itself uses chunks to retrieve and store data.

By using the term "requested bytes", I tried to make clear, that there is a distinction between the requested bytes in a request for HTTP based remotes and the actual downloaded bytes. The "requested bytes" will probably be used for the quota calculation for most providers. At least Google Drive is using this value.

There is no additional caching involved when using these flags, neither on disk nor in memory. They only influence how rclone requests data from the remote. For HTTP based remotes, this means using HTTP range requests.

B4dM4n:

It is a tradeoff between “increased number of API calls” and “wasted download quota if closed early”. This all only makes sense for non cached mounts. As explained in a previous post, some workloads can produce large amounts of “wasted download quota if closed early” a non cached mount.

To reduce the “increased number of API calls” overhead there is the second flag, --vfs-read-chunk-size-limit, which lets the requested HTTP range grow exponentially.

If you set --vfs-read-chunk-size 1G --vfs-read-chunk-size-limit 50G and read a 10 GB file from your mount, there will only be 4 requests: 0-1GB, 1GB-3GB, 3GB-7GB and 7GB-end.

The numbers 128M and 2G from my first post are only simple guesses, which may need adjustment. They heavily depend on the use case and the daily limits of the remote. I will probably try 64M and 8G at some time for my Google Drive, but since the current values seem to work I don’t see a need for a change.

Animosity022 · May 20, 2018, 11:54am

Rclone doesn’t request the whole file for reading but only the parts the operating systems asks for. it depends on what the ask from the operating system is.

The cache specifically helps with reducing API calls.

What happens when you scan the plex library? What happens when Radarr moves a file? Sonarr? Those are going to slam the API without having the cache.

B4dM4n · May 20, 2018, 12:47pm

Indeed, the cache backend will only request chunks of the configured size. But otherwise rclone will always request the whole file. The operating system has no way to tell rclone which part is needed when a file get's opened.
vfs-read-chunk-size works around this limitation, by only requesting parts and seamlessly combine multipple parts during read.

My whole Plex library is scanned daily, without heavy tasks like chapter thumbnails or loudness analysis.
Without vfs-read-chunk-size I ran into 24h bans when scanning to many new files at once.

I use Radarr and Sonarr only for scanning to get a overview of missing episodes and not for automatic downlaods and sorting.
For sorting I use custom scripts.

Looking at the Google API console I'm using around 150k requests per day, which is well inside the 1 billion limit .

Animosity022 · May 20, 2018, 1:10pm

You have something misconfigured if you were getting bans with the cache as I have a little over 40TB, scan at will and barely get 150k in a week.

I’m not following your second part as there are many examples of requesting part of a file.

Run a mediainfo on a large file. It doesn’t grab 50GB to do a mediainfo and it just grabs part of the file.

-rw-rw-r-- 1 felix felix 50369164567 May 12 01:27 Black.Panther.2018.mkv
time mediainfo  Black.Panther.2018.mkv
|real|0m5.769s|
|---|---|
|user|0m0.120s|
|sys|0m0.010s|

Perhaps worded poorly as mediainfo requests part of the file from the operating system and not the entire thing.

Can you share your config that was not working previously and creating bans?

B4dM4n · May 20, 2018, 2:09pm

I never used cache. My mount point was always a gdrive > crypt remote. My drive will soon reach 200TB everything included. So there are a few more files to be scanned

You are right, rclone will only download the needed bytes, that mediainfo reads. But the call that rclone sends to the remote will request the whole file, starting at byte 0 until the end. This "requested bytes" number is counted towards the daily quota, not the "downlaoded bytes" (at least for Google Drive).

My config is quite simple:

[gsuite]
type = drive
client_id = my.client_id.apps.googleusercontent.com
client_secret = my_client_secret
token = {...}

[gsuite-crypt]
type = crypt
remote = gsuite:data
filename_encryption = standard
password = password
password2 = password2

When running a Plex scan using this config against let's say 1TB new files, this would result in a 24h ban.
Plex will open every file, read some data and seek to multiple positions in the file to collect the metadata.
As described above, every open and seek will count to the "requested bytes" limit (which seems to be 10TB for Google Drive). When Plex seek's 10 times, 1TB of new files will exceed the 10TB limit.

Here is a example strace of mediainfo where this behavior can bee seen.
I annotated the calls that trigger the rclone request with the byte range requested. Running mediainfo on this 77 GB file caused 3 opens and 13 seeks, in total 16 requests send by rclone. Summing up the bytes requested, this would add 1080.09 GB to the "requested bytes" limit.

Running the same command with --vfs-read-chunk-size 128M would only add 2 GB to this limit.

I hope I could explain the difference between "downlaoded bytes" and "requested bytes".

neik · May 20, 2018, 2:14pm

Would you mind sharing your sorting scripts / method?

@vfs-read: If I got it right, it does also cache the read parts of the file. So, if the file would be read by two clients the first one would load and cache it locally while the second one would be served by the local cache, right?

gforce · May 20, 2018, 2:17pm

What does your script look like? It would help alot of us

B4dM4n · May 20, 2018, 2:37pm

This is not so easy. They are a combination of bash scripts, Go tools and Chrome Extensions only written for my needs. I cannot use Sonarr or Radarr because German content is not that easy to find in an automated way (at least I have trouble finding it).

I will add some of my scripts that are more universal to the rclone wiki in the near future.

--vfs-read-chunk-size will not cache or share any data between multiple readers. It works fully transparent in the vfs layer without altering any cache behavior. It only concatenates the chunks virtually, so the vfs layer thinks its reading from a regular file.
--buffer-size will still work as intended as the top level buffer, providing data when a new chunk is opened.

Using --vfs-read-chunk-size in combination with a cache remote will not provide any benefit, as the cache already works on a chunk basis.

Animosity022 · May 20, 2018, 2:44pm

Well of course you got a ban as you aren’t using the cache feature in your config.

I’ve scanned my entire library in less than 48 hours and at:

I’ve never ever seen the 10TB limit thing in a day. With more files, it begs to ask, why aren’t you using the cache feature?

Animosity022 · May 20, 2018, 2:46pm

Where do you see your “requested bytes” for the day?

gforce · May 20, 2018, 2:50pm

@B4dM4n

This looks promising, i added the stuff you put into my traditional rclone service and testing for https://github.com/Admin9705/PlexGuide.com-The-Awesome-Plex-Server

If it gets past show 300, then potential, scans to 1900; then set I’ll report how it works. If works, then make a teamdrive script same way, then unionfs it and deploy our ST2 transfer script to bypass the 750GB upload limit (which works).

B4dM4n · May 20, 2018, 3:04pm

This seems to be a not so official limit. It gets mentioned in this post or on reddit.

A long time my Plex was running on a low disk space server at scaleway.
The bandwidth was great, but the Plex library used most of the disk space. So I decided to work on chunked reading to reduce the risk of bans.

As it is in official, there is no way to see this.

Animosity022 · May 20, 2018, 3:07pm

If I’m following your logic, why not use cache because of bans?

The chunked would reduce the 10TB limit thing.

Animosity022 · May 20, 2018, 3:08pm

And back to my other question, if I’m running a mediainfo on a 55GB file, you are saying it would take 55GB on ‘requested bytes’, but there is no way to see this anywhere so how can you tell?

B4dM4n · May 20, 2018, 3:28pm

The primary reason was, not having enough disk space as the cache backend was new.
The chunk behavior in the cache backend is also fairly new. At the time this got implemented, I started working on the chunked reading for the vfs layer.

Without cache or chunked reading, it would probably add around 800 GB to this limit. mediainfo opens the file multiple times and uses seek to navigate around the file at least 10 times. Each of these will add nearly the whole file size to the limit.

Animosity022 · May 20, 2018, 3:45pm

I’m just not seeing it. I’ve scanned multiple times from a fresh start and I can get through about 20TB in a day using both plex and emby at the same time. Emby does a ffprobe and plex does it’s own analyze which would by your count have me at 100s of TB in a day.

The cache hits those items you’ve stated so you should take a look.

B4dM4n · May 20, 2018, 4:02pm

These numbers only apply to a uncached mount, where every open and seek results in a new request to the remote.

I simply don’t want to use cache, as it would not provide any benefit to me, except the chunk handling.
Most of the files will probably only be read once ore twice, besides indexing, so why store them locally.

Animosity022 · May 20, 2018, 4:10pm

I would rather have a switch like plexdrive and just use memory and store nothing locally and just use memory, but I go around that by using a small /dev/shm instead.

It would provide a reduction in API hits as shown with our screenshots

How long does a plex full scan of 200TB take (assuming all the files are already analyzed)?

B4dM4n · May 20, 2018, 4:41pm

I have no issue with the high API usage. It’s still only a fraction of the daily limit

Most of these requests are drive.files.list calls (70%), to keep the metadata cache fresh.
With a fresh metadata cache the Plex scan only takes 10-15 min.

Listing all files on a cold cache takes much longer. Depending on other tasks, around 60 minutes normally.

neik · May 20, 2018, 8:16pm

Yeah, I know that problem also need German content and having exactly the same problem.

When you publish them a forum post would be great!

Thanks for the clarification! Just one more question regarding the limit flag: Those 2G you have in your mount command are they stored locally (storage) or in the memory? Tbh, that part I haven't figured out yet.

Anyways, this feature might help me with the issue over here (Handling open files).
I'm gonna give it a try.