rclone use fadvise to configure readahead. The problem is that linux readahead is limited to 128K by default.
see: /sys/block/<your_dev>/queue/read_ahead_kb
128K is a tiny amount for fast connections, especially if one is doing many threads of transfer to max out one's connection.
This will cause a huge amount of thrashing as it is essentially constantly reading and seeking (on spinning disks).
I'm currently experimenting with setting it to 10MB to see if it causes less thrashing.
Personally, I wish this was settable not at the device level, but at the file descriptor level, but its not, but for my purposes for my data device which is mostly meant for long streaming reads or writes and not random data, this is probably ok
I think its more complicated than just that setting. trying to understand this better, but it does seem somewhat configurable, don't have a good handle on it.
It really seems that linux default readahead is really not agressive enough for a high bandwidth world, 128KB readahead is nothing. even 1MB is nothing.
I do anywhere from 10-60 parallel transfers to max out my bandwidth and that really kills the ability to do anything else on the spinning disk (well raid5, which I think makes it somewhat worse due to write/read amplification. especially for metadata modifications)
dont think one wants to do it on every read, that would effectively be the same thrashing. I'd want to experiment wants to say every 25MB, extended it 50MB (or someting along those lines)
can't experiment right now, my home server is in the middle of an 80 thread upload that brings it to its knees (at least in trying to do anything else on that disk), will build and try this is with my next upload. need a way to effectively measure io thrashing besides "feel"
for reference, this is my system under its current heavy load (i.e. lots of reading via rclone and other streaming writes trying to happen in parallel)
hopefully provides a good baseline to compare against once I run my test version
good news: my pr doesn't break rclone. I can't tell if it improved the situation though doesn't seem to have made it worse, but hard to tell. only doing a 30 thread transfer right now, can't tell if machine is behaving better or not
there probably is, the issue is that there are multiple variables
is sys_readahead actually doing what one expects?
what size should be the buffer one creates.
does it not make a huge difference because we end up with a thundering herd of the number of threads that all call sys_readahead one right after the other and hence are always competing for the same finite iop resources.
I should probably write a small go program that reads a few multi-GB files.
file 1 is read normally and benchmarked as such
file 2 is called with sys_readahead on entire size and then read normally immeaditely afterwards. whole process is benchmarked
file 3 is is called with sys_readahead, but a significant amount of time (perhaps >= to file 1 time, to allow sys_readahead to do everything) and then file is read and timed (and only that is timed).
at least then I'd have an idea if I was calling sys_readahead correctly.
I think your idea of writing a test program is a good one.
I don't know whether sys_readahead is going to help as I believe the linux kernel has pretty good heuristics for detecting streaming reads (can't remember where I read that).
I don't believe it actually creates the buffer does it, it just requests that the file should be read into cache if possible?
it doesn't create a buffer, but it does tell the kerne how much to read in. I was referring to that as a buffer.
Basically my issue is optimizing iop usage as spinning disks have really limited # of iops. I can upload at 30-40MB/s, and my disk can easily stream a single streamer at 30-40MB/s with plenty of iops left for other operations. But like right now where I'm doing 30 transfers, it just kills my disk from being able to do anything else efficiently.
I'm trying to figure out a way to optimize this. I might go back to the idea I had (at least experimenting wise).
kick off N (say 30) go routines that read from a unique channel. have another go routine that has access to all N channels and cycles through all channels and the file associated with it. it reads from file in large chunks and writes to channel iterating through every file as it goes chunk by chunk. this will limit it to a single read at a time (as opposed to N threads doing disk io). Want to see how fast it can complete vs N threads all doing reading (faster/slower...)
Rclone makes some attempt to do this on its own with the async buffer. This reads ahead in 1MB chunks and its purpose was to improve transfers on HDDs specifically. This may be interfering with your tests!