Sudden drop in download bandwidth during transfer

seuffert · October 18, 2018, 9:43am

I am experiencing a rather strange phenomenon lately.

When i access a file trough a rclone mount the transfer starts very fast but at some point drops dramatically to ~5mbit/s and stays there for all subsequent chunks the ChunkedReader loads.

I tested it with rclone v1.44, v1.43.1 & v1.42. The result is not always 100% reproducible but will always occur eventually. One interesting thing i observed is, that if i read the file at full speed (into /dev/null) the speed will not decrease at all until the transfer is complete.
I tested it with "pv -L 1500K /mount/rclone/test.file >/dev/null" to mimic reading at lower speeds and i was able to reproduce the sudden drop in download speed.

This is a graph of 3 downloads of the same file with 3 rclone versions showing the described problem.

The server running rclone is connected with 1 gigabit (unthrottled) and has an excellent peering to the google servers. The issue started to manifest a few days ago. I tried using the gdrive v2 api which unfortunately did not help in this case.

    rclone mount gdrive: /mnt/gdrive \
            -vv --stats 30s --read-only --allow-other --rc --umask 0222 --attr-timeout 1s \
            --dir-cache-time 672h --vfs-cache-max-age 675h \
            --vfs-read-chunk-size 8M \
            --buffer-size 0 \
            --vfs-read-chunk-size-limit 8M \
            --tpslimit 2 \
            --tpslimit-burst 6

(After many tests i intentionally added the chunk size limit of 8M to limit the effect on the daily gdrive download limit.)

Do you guys have any ideas or input on this issue? The debug log does not show anything unusual except it reads at a lower speed (no 403s or other errors)

Animosity022 · October 18, 2018, 11:17am

Can you check the things around rclone? I copied a few files and can’t reproduce the issue. Forcing it down to 8M chunks would generate a lot of API traffic for a file copy so I wonder of that has something to do with it.

Is it your own server? A VPS through a provider?

I let a few copies run for 10 minutes or so and do not see the issue as it stays consistent at very good speeds.

felix@gemini:/gmedia/Radarr_Movies/Ready Player One (2018)$ rsync --progress Ready\ Player\ One\ \(2018\).mkv  /data
Ready Player One (2018).mkv
 16,445,636,608  23%   91.15MB/s

seuffert · October 18, 2018, 11:28am

API traffic with 8M chunks is really not an issue and stays well within limits even with 10 simultaneous transfers (checked in api console). It doesn’t really matter for this issue if i choose 8M,16M or higher values (i tested with several values). I just didn’t want to run into any issues with the daily/hourly download limits on gdrive during testing (initial values were up to 512M chunk size)

The most curious thing is , as mentioned earlier, if i run a transfer at full speed (like your rsync copy to /dev/null ~30MByte/s with 8M chunks) it will never run into this issue, only if the file is read at a lower pace.

The test system is a well groomed root server.

Animosity022 · October 18, 2018, 11:31am

Are you able to check with iotop and see if something is IO bound for whatever reason? It feels like something else on the system is a bottleneck, but not sure what if the /dev/null copy is working that would rule out CPU/Memory/Network for me unless I’m missing something.

The only thing left would be Disk or something specific with rlcone itself.

seuffert · October 18, 2018, 11:39am

I can rule out an IO bottleneck as well as CPU limitations. The system is barely doing anything during the transfer. Network is also tested and working up to sustained 900mbit/s. None of the transfers saves anything to the disk.

Since this started “out of the blue” and is also present in earlier rclone versions, it is more likely the sender is doing “something” - in this case gdrive. I just haven’t found out HOW or WHY…

Animosity022 · October 18, 2018, 11:44am

So the only thing I can think to try is maybe forcing it to another endpoint near you to see if that changes anything.

If you check the resolution and enter in a hosts file entry for:

host www.googleapis.com
www.googleapis.com is an alias for googleapis.l.google.com.
googleapis.l.google.com has address 172.217.10.42
googleapis.l.google.com has address 172.217.11.10

Mine grabs the 10.42 first but you could pick from one of the other 10 or so IPs it resolves to and try to force it to a particular end point and see if anything changes.

seuffert · October 18, 2018, 11:49am

good thinking - i will try that later , thx

ncw · October 18, 2018, 12:11pm

You could also try

  --drive-v2-download-min-size SizeSuffix   If Object's are greater, use drive v2 API to download. (default off)

Which may or may not make a difference!

seuffert · October 18, 2018, 12:14pm

thx for the idea Nick. i already tried this in my initial testing.

I tried using the gdrive v2 api which unfortunately did not help in this case.

seuffert · October 19, 2018, 7:34am

i tested it with several specific google api endpoints one after another , specified in /etc/hosts . (i had to compile my own rclone version to be able to use export GODEBUG=netdns=cgo )

no luck here is another example graph of what is happening. you can clearly see the limiting at the end to almost exactly 5mbit/s. Again, if i load the file at full speed, the limiting never kicks in…

@ncw is there any possibility that some bandwidth rate limiter part of rclone is going haywire under special circumstances? Debug log shows nothing out of the ordinary

Joost1991 · October 19, 2018, 9:25am

So, I myself had some trouble lately with my network speeds really slowing down on 1gbit connection.

I manage to solve this by downgrading the kernel from Linux 4.15.0-36-generic x86_64 to Linux 4.15.0-34-generic x86_64.

Maybe this can help you as well.

Edit: more info on this, here

seuffert · October 19, 2018, 9:47am

thank you for this info! the server is indeed currently running "4.15.0-36-generic"
i will test a downgrade or upgrade to proposed 4.15.0-37 wich apparently already has the fix

Animosity022 · October 19, 2018, 11:11am

For what it is worth, I’m using:

Linux gemini 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64 GNU/Linux

As I use Debian over Ubuntu.

seuffert · October 19, 2018, 12:35pm

I upgraded the Linux kernel to bionic-proposed 4.15.0-38-generic and tested the same configuration with several different files at random speeds, just as before.

I would never have guessed this could be a kernel bug… but the results do speak for themselves. Not a single bandwidth drop in roughly 1 hour. It is too early to say this issue is completely gone , but it looks very promising.

Thanks everybody for your time and valuable input!

Animosity022 · October 19, 2018, 12:43pm

That’s crazy, but an awesome find overall.

Another reason to stick with Debian

neik · October 21, 2018, 8:35am

Good thing I haven’t upgraded to .36 yet, I guess.
Is that newer kernel version already available in the official repository?

What software do you use for those fancy graphs?

Animosity022 · October 21, 2018, 11:47am

Oh those aren’t fancy graphs. That looks to be simple ones
If you want graphs, check out netdata as I store all that in influx and use grafana to visualize the data.

seuffert · October 21, 2018, 6:00pm

Kernel is in official proposed .

Graphs are made with collectd+cgp (collectd graph panel) ,very lightweight solution

seuffert · October 21, 2018, 6:02pm

Status update to this issue: no more problems since upgrading the Linux Kernel. Just wow…

neik · October 30, 2018, 12:40pm

I went from .33 kernel to .38 and didn’t have the speed issues but a different one.

I download stuff using JDownloader and then extract it directly to the mount. From there I move the files to a specific movies/shows folder so FileBot can do its magic.

Unfortunately, since the kernel update there seems to be something wrong, whenever I move the files into the “sort-folder” some of them are missing.

My mount is rclone_1.42 mount --allow-non-empty --allow-other --dir-cache-time 8760h --poll-interval 0 --vfs-cache-mode writes --vfs-cache-max-age 0m --vfs-read-chunk-size 64M --vfs-read-chunk-size-limit 2G --drive-use-trash=false upload: /media/cry

Rclone version is 1.42 stable. I moved to 1.44 after the kernel update and thought this issue would be related to the new rclone version thus went back to 1.42 but the problem persists.

Anyone having the same issue?

I will move to 1.44 and activate debug log, let’s see if we can narrow this down.

Any hint is appreciated!