Rclone and linux readahead

spotter · February 2, 2020, 11:31am

rclone use fadvise to configure readahead. The problem is that linux readahead is limited to 128K by default.

see: /sys/block/<your_dev>/queue/read_ahead_kb

128K is a tiny amount for fast connections, especially if one is doing many threads of transfer to max out one's connection.

This will cause a huge amount of thrashing as it is essentially constantly reading and seeking (on spinning disks).

I'm currently experimenting with setting it to 10MB to see if it causes less thrashing.

Personally, I wish this was settable not at the device level, but at the file descriptor level, but its not, but for my purposes for my data device which is mostly meant for long streaming reads or writes and not random data, this is probably ok

ncw · February 2, 2020, 10:31pm

Interesting, I didn't know you could configure the readahead in /sys

Does it make much difference? How would you measure the improvement?

spotter · February 2, 2020, 10:54pm

I think its more complicated than just that setting. trying to understand this better, but it does seem somewhat configurable, don't have a good handle on it.

It really seems that linux default readahead is really not agressive enough for a high bandwidth world, 128KB readahead is nothing. even 1MB is nothing.

I do anywhere from 10-60 parallel transfers to max out my bandwidth and that really kills the ability to do anything else on the spinning disk (well raid5, which I think makes it somewhat worse due to write/read amplification. especially for metadata modifications)

Animosity022 · February 3, 2020, 4:07am

See this issue:

spotter · February 3, 2020, 9:40am

my issue is reading many streams not writing many streams.

I'm wonderin if leveraging sys_readahead() might be of use

http://man7.org/linux/man-pages/man2/readahead.2.html

i.e. provide an alternative to the fadvise wrapper that already exists in local, but using readahead to continually keep a readahead buffer.

ncw · February 3, 2020, 10:29am

So here, call readahead to get the next x MB of data in the cache?

github.com

rclone/rclone/blob/3dfa63b85cf996c98f52c72b938c09b149dc4fa9/backend/local/local.go#L823


      
          // localOpenFile wraps an io.ReadCloser and updates the md5sum of the
          // object that is read
          type localOpenFile struct {
          	o    *Object           // object that is open
          	in   io.ReadCloser     // handle we are wrapping
          	hash *hash.MultiHasher // currently accumulating hashes
          	fd   *os.File          // file object reference
          }
          
          // Read bytes from the object - see io.Reader
          func (file *localOpenFile) Read(p []byte) (n int, err error) {
          	if !file.o.fs.opt.NoCheckUpdated {
          		// Check if file has the same size and modTime
          		fi, err := file.fd.Stat()
          		if err != nil {
          			return 0, errors.Wrap(err, "can't read status of source file while transferring")
          		}
          		if file.o.size != fi.Size() {
          			return 0, fserrors.NoLowLevelRetryError(errors.Errorf("can't copy - source file is being updated (size changed from %d to %d)", file.o.size, fi.Size()))
          		}
          		if !file.o.modTime.Equal(fi.ModTime()) {

spotter · February 3, 2020, 10:32am

dont think one wants to do it on every read, that would effectively be the same thrashing. I'd want to experiment wants to say every 25MB, extended it 50MB (or someting along those lines)

ncw · February 3, 2020, 10:38am

Will you have a go?

spotter · February 3, 2020, 11:30am

if I get chance will see at time permits.

spotter · February 3, 2020, 12:12pm

found some free time at gophercon israel to try and knock something out in between sessions

something along the lines of this (extract current Read() to a local_other.go and this would be local_linux.go.

the consts are just for proof of concept, would want a way to specify them at the command line

also added a readAheadTill member variable to the localOpenFile struct, unsure if there's a better way to do it.

Baically would want to get this to work as is and see if it makes a difference.

// +build linux

package local

import (
	"github.com/pkg/errors"
	"github.com/rclone/rclone/fs"
	"github.com/rclone/rclone/fs/fserrors"
	"io"
	"syscall"
)

const (
	READAHEAD = 50*1024*1024
	REREAD = 25*1024*1024
	doReahead = true
)

// Read bytes from the object - see io.Reader
func (file *localOpenFile) Read(p []byte) (n int, err error) {

	if doReahead {
		curPos, err := file.fd.Seek(0, io.SeekCurrent)
		if err == nil {
			if file.readAheadTill == 0 || (file.readAheadTill - curPos) < REREAD {
				r0, _, errno := syscall.Syscall(syscall.SYS_READAHEAD, file.fd.Fd(), file.readAheadTill, READAHEAD)
				if r0 != 0 {
					fs.Logf(file.o, "failed to execute sys_readahead: %v", errno)
				} else {
					file.readAheadTill += READAHEAD
				}
			}
		} else {
			fs.Logf(file.o, "Couldn't get current seek point, skipping readhead")
		}
	}

	if !file.o.fs.opt.NoCheckUpdated {
		// Check if file has the same size and modTime
		fi, err := file.fd.Stat()
		if err != nil {
			return 0, errors.Wrap(err, "can't read status of source file while transferring")
		}
		if file.o.size != fi.Size() {
			return 0, fserrors.NoLowLevelRetryError(errors.Errorf("can't copy - source file is being updated (size changed from %d to %d)", file.o.size, fi.Size()))
		}
		if !file.o.modTime.Equal(fi.ModTime()) {
			return 0, fserrors.NoLowLevelRetryError(errors.Errorf("can't copy - source file is being updated (mod time changed from %v to %v)", file.o.modTime, fi.ModTime()))
		}
	}

	n, err = file.in.Read(p)
	if n > 0 {
		// Hash routines never return an error
		_, _ = file.hash.Write(p[:n])
	}
	return
}

spotter · February 3, 2020, 12:55pm

and made an untested PR of the above code (the CI will be the first test of it)

ncw · February 3, 2020, 1:30pm

That looks about right!

Does it speed things up is the important question?

I'd factor a new function say readAhead into a readahead_linux.go and make an empty one in readahead_other.go.

I note you could count the bytes read which will save calling Seek which will save a syscall potentially.

spotter · February 3, 2020, 1:33pm

can't experiment right now, my home server is in the middle of an 80 thread upload that brings it to its knees (at least in trying to do anything else on that disk), will build and try this is with my next upload. need a way to effectively measure io thrashing besides "feel"

spotter · February 3, 2020, 5:02pm

for reference, this is my system under its current heavy load (i.e. lots of reading via rclone and other streaming writes trying to happen in parallel)

hopefully provides a good baseline to compare against once I run my test version

02/03/20 18:59:03
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb
               314.00  151.00     33.4M     18.7M     0.00     0.00   0.0%   0.0%  235.69  810.44 254.10   109.0k   126.8k   2.15 100.0%

02/03/20 18:59:04
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb
               442.00  235.00     47.3M     29.1M     0.00     4.00   0.0%   1.7%  426.04  777.00 267.26   109.7k   126.6k   1.48 100.0%

02/03/20 18:59:05
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb
               464.00  147.00     55.1M     16.0M     0.00    28.00   0.0%  16.0%  195.67  599.29 257.23   121.6k   111.8k   1.64 100.0%

02/03/20 18:59:06
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb
               278.00  151.00     34.4M     18.2M     0.00     0.00   0.0%   0.0%  380.00 1018.38 263.02   126.6k   123.7k   2.33 100.0%

02/03/20 18:59:07
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb
               626.00  256.00     68.2M     31.5M     0.00     0.00   0.0%   0.0%  248.36  669.61 270.72   111.5k   126.0k   1.13 100.0%

spotter · February 4, 2020, 10:35pm

good news: my pr doesn't break rclone. I can't tell if it improved the situation though doesn't seem to have made it worse, but hard to tell. only doing a 30 thread transfer right now, can't tell if machine is behaving better or not

ncw · February 5, 2020, 10:14am

I wonder if there is a more scientific way of measuring it?

spotter · February 5, 2020, 11:53am

there probably is, the issue is that there are multiple variables

is sys_readahead actually doing what one expects?
what size should be the buffer one creates.
does it not make a huge difference because we end up with a thundering herd of the number of threads that all call sys_readahead one right after the other and hence are always competing for the same finite iop resources.

I should probably write a small go program that reads a few multi-GB files.

file 1 is read normally and benchmarked as such
file 2 is called with sys_readahead on entire size and then read normally immeaditely afterwards. whole process is benchmarked
file 3 is is called with sys_readahead, but a significant amount of time (perhaps >= to file 1 time, to allow sys_readahead to do everything) and then file is read and timed (and only that is timed).

at least then I'd have an idea if I was calling sys_readahead correctly.

ncw · February 5, 2020, 9:47pm

I think your idea of writing a test program is a good one.

I don't know whether sys_readahead is going to help as I believe the linux kernel has pretty good heuristics for detecting streaming reads (can't remember where I read that).

I don't believe it actually creates the buffer does it, it just requests that the file should be read into cache if possible?

spotter · February 6, 2020, 12:12am

it doesn't create a buffer, but it does tell the kerne how much to read in. I was referring to that as a buffer.

Basically my issue is optimizing iop usage as spinning disks have really limited # of iops. I can upload at 30-40MB/s, and my disk can easily stream a single streamer at 30-40MB/s with plenty of iops left for other operations. But like right now where I'm doing 30 transfers, it just kills my disk from being able to do anything else efficiently.

I'm trying to figure out a way to optimize this. I might go back to the idea I had (at least experimenting wise).

kick off N (say 30) go routines that read from a unique channel. have another go routine that has access to all N channels and cycles through all channels and the file associated with it. it reads from file in large chunks and writes to channel iterating through every file as it goes chunk by chunk. this will limit it to a single read at a time (as opposed to N threads doing disk io). Want to see how fast it can complete vs N threads all doing reading (faster/slower...)

ncw · February 6, 2020, 10:28am

Rclone makes some attempt to do this on its own with the async buffer. This reads ahead in 1MB chunks and its purpose was to improve transfers on HDDs specifically. This may be interfering with your tests!