Constantly high IOWAIT (add log)

bhpark84 · February 3, 2020, 3:50pm

Please excuse any mistakes as English is my second language

To reduce test coverage I removed the cache backend.
goodrive -> crypt

Tested by copying one file to local disk using cp after mount

What is the problem you are having with rclone?

Constantly high IOWAIT

What is your rclone version (output from `rclone version`)

v1.51

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Ubuntu 18.04.3 LTS / 64 Bit

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

/usr/bin/screen -dmS gc /usr/bin/rclone mount gdrive_crypt: /mnt/gc 
--config /mnt/hdd/rclone/rclone.conf 
--use-mmap --fast-list --vfs-cache-mode writes 
--cache-dir /mnt/hdd/tmp/gc/cache 
--allow-other --umask 0 --tpslimit 10 --tpslimit-burst 20
--rc --rc-user xxxx --rc-pass xxxx -vv --log-file=/mnt/hdd/tmp/rclone.log

A log from the command with the `-vv` flag (eg output from `rclone -vv copy /tmp remote:tmp`)

http://www.mediafire.com/file/2ilu9k2nu9m9rzp/rclone_1_51.log/file
http://www.mediafire.com/file/wxbyldxn065v3vt/rclone_1_50.2.log/file

The following lines have been added.
2020/02/03 23:58:31 DEBUG : media/드라마/스토브리그.E01.191213.1080p-NEXT.mp4: waiting for in-sequence read to 1008861184 for 5ms

Netdata graph when testing
v1.51

v1.50.2

Thank you.

Animosity022 · February 3, 2020, 7:50pm

Yeah, that seems to be the IOWAIT as it's waiting for data and aborting:

2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: waiting for in-sequence read to 393216 for 5ms
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: aborting in-sequence read wait, off=393216
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: ChunkedReader.Read at 520192 length 524288 chunkOffset 0 chunkSize 134217728
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: >Read: read=131072, err=<nil>
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: Read: len=131072, offset=655360
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: waiting for in-sequence read to 655360 for 5ms
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: Read: len=131072, offset=262144
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: >Read: read=131072, err=<nil>
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: waiting for in-sequence read to 655360 for 5ms
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: Read: len=131072, offset=524288
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: waiting for in-sequence read to 524288 for 5ms
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: aborting in-sequence read wait, off=655360
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: aborting in-sequence read wait, off=524288
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: ChunkedReader.Read at 1044480 length 1048576 chunkOffset 0 chunkSize 134217728
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: >Read: read=131072, err=<nil>
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: Read: len=131072, offset=917504
2020/02/03 23:54:40 DEBUG : media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4: waiting for in-sequence read to 917504 for 5ms
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: >Read: read=131072, err=<nil>
2020/02/03 23:54:40 DEBUG : &{media/드라마/스토브리그 (2019)/스토브리그.E01.191213.1080p-NEXT.mp4 (r)}: >Read: read=131072, err=<nil>

@ncw - what do you think is causing that?

ncw · February 4, 2020, 8:52am

This is the new code which makes the fuse mount run asynchronously. This is great for performance normally, however it means sometimes rclone gets the reads out of order so it has to wait for an in-order read.

This problem has two possible solutions

disable the async mount option - this will revert to 1.50 behaviour
increase the wait time to > 5ms

Option 1 fixes the slow interactions with linux 5.4 kernel so isn't ideal. Option 2. is bad because it adds that much delay to rclone doing a seek.

I can make 1 & 2 configurable easily enough, but I think what might be needed is adaptive control. Maybe rclone should measure the interval between out of order requests...

I'll post something with some more controls for you to try in a bit, then we can decide what goes in the point release.

ncw · February 4, 2020, 11:13am

Here is an rclone with 2 new flags for rclone mount

--async-read=false will revert to 1.50 behaviour. This is half the speed on my machine though!

--vfs-read-wait 10ms will set the wait time.

Can you

verify that adding --async-read=false fixes the problem
Experiment with --vfs-read-wait 10ms (without the above flag) and find the smallest value for the time that means you don't get the aborting in-sequence read wait messages

You might find the number you need is only a bit bigger than 5ms which is the current default. I found in my testing with drive that 8ms was enough to get rid of all the aborts.

I've been testing with

rclone mount --async-read=true --vfs-read-wait 8ms -vv TestDrive: /mnt/tmp/ 2>&1 | grep 'abort.*in-sequence'

And

dd if=/mnt/tmp/1GB-zeros of=/dev/null bs=2M

But some real world tests would be great.

https://beta.rclone.org/branch/v1.51.0-009-g6601e0de-fix-iowait-beta/ (uploaded in 15-30 mins)

If this goes well then what I'll do is issue a point release with a higher wait.

EnorMOZ · February 4, 2020, 12:18pm

I found it to be very random with -vfs-read-wait. Tried 5-15ms sometimes it will be fine for a while then get the aborting message. 12ms worked for me consistently.
--async-read=true was definitely faster though. I haven't looked at i/o as this was merely a speed/error test.

JamesY · February 4, 2020, 12:34pm

Is this the root cause of the constantly high iowait? During my brief test of 1.51 I have iowait bogging down my system that I downgrade back to 1.50. That time, things became slow. So with 1.51 what speed is being improved? Is it speed for rclone copy only? Because rclone mount was slow due to iowait

bhpark84 · February 4, 2020, 1:48pm

below test removed crypt backend.
no difference, with or without a crypt backend.

executed 'dd if=/mnt/tmp/1g_zero of=/dev/null bs=2M'

Increasing the time decreases the error output frequency.

wait time and error output count(somewhat variable)
5ms     => 394
10ms    => 63
20ms    => 11
50ms    => 4

However, IOWAIT will not decrease over 10ms.
It is also still high compared to the --async-read=false option.

Animosity022 · February 4, 2020, 2:11pm

What kernel are you guys running?

felix@gemini:~$ uname -a
Linux gemini 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

EnorMOZ · February 4, 2020, 2:12pm

Linux ns3085189 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

JamesY · February 4, 2020, 2:28pm

Linux 4.15.0-42-generic #45-Ubuntu SMP

Linux 2.6.32-042stab142.1 #1 SMP

Linux appledev 2.6.32-48-pve #1 SMP

EnorMOZ · February 4, 2020, 2:42pm

@bhpark84
Looks like even setting --async-read=false had an increase of system i/o compared to 1.50.2. Any chance you could compare the two ?

bhpark84 · February 4, 2020, 3:04pm

root@localhost:~# uname -a
Linux localhost 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

bhpark84 · February 4, 2020, 3:15pm

1.51(--async-read = false) & 1.50.2

this server is lowend VPS.
So it can look high.

St0rm · February 4, 2020, 7:28pm

I'm not experiencing any high IOWAIT. Just posting for info

Linux - 4.15.0-76-generic #86-Ubuntu

bhpark84 · February 5, 2020, 1:51am

Upload debug trace/profile

This was my first time using debug tool.
Please let me know if something is wrong

Download

Test method

mount  
./rclone mount --async-read=true --vfs-read-wait 10ms -vv gdrive_media: /mnt/tmp/ --rc 2>&1 | grep 'abort.*in-sequence' 

transfer
dd if=/mnt/tmp/1g_zero of=/dev/null bs=2M

get profile(In transmit)
wget http://localhost:5572/debug/pprof/trace?seconds=5
wget http://localhost:5572/debug/pprof/profile
wget http://localhost:5572/debug/pprof/goroutine?debug=1
wget http://localhost:5572/debug/pprof/block
wget http://localhost:5572/debug/pprof/mutex
wget http://localhost:5572/debug/pprof/heap

thanks

EnorMOZ · February 5, 2020, 3:40pm

Added a github issues for tracking purpose. https://github.com/rclone/rclone/issues/3937

ncw · February 5, 2020, 5:42pm

I'm attempting to find a work-around for the excess seeking which kills performance.

I'll note that increased IOWAIT is to be expected when using async-read.

The kernel issues two or more read requests at once, but rclone can only satisfy one of them at a time as it is reading sequentially from the file, so the other, by definition has to wait. I don't think the IOWAIT should affect performance at all, but if rclone has to seek, when you see a "failed to wait for in-sequence read" message, that really kills performance. That is what I'm working on fixing!

EnorMOZ · February 7, 2020, 2:02pm

I get the increase in iowats but what I was seeing was a 20-30% increase even on an idle/very minimal activity mount. Will see if I can run some tests later when I can.

ncw · February 10, 2020, 4:32pm

I'm thinking about merging this to a point release.

I'd like to increase the wait time up from 5ms by default. I propose 20 ms but maybe that is too high/low - thoughts?

ferferga · February 27, 2020, 1:20pm

Why the --async-read flag is unknown in my rclone binary, even if I'm using latest beta?