Thanks for the suggestion. Checked that too, it seems correct and reflects any changes in the max pages setting appropriately.
Is there any way to get debug statements in mergerfs that show the read & write sizes?
I just want to be sure that this is something related to the fuse implementation and not something in my system that's causing it. Will give me a specific area to debug further.
-d will run it in the foreground and spit out debug details. It's the standard libfuse debug info. Been meaning on rewriting to make it more useful but should give you the details you want.
cp gives writes of 128k whereas rsync gives writes of 256k. Could this be some limitation with the program used instead of the fuse library itself? I was able to replicate the same behavior on both the rclone & mergerfs mounts.
Even though rsync has double the write size, the time taken by it is double the time taken by cp. This was observed only with the mergerfs mount and not with the rclone mount. I am not sure how this is happening...
mergerfs version: 2.28.3
FUSE library version: 2.9.7-mergerfs_2.29.0
fusermount version: 2.9.7
using FUSE kernel interface version 7.29
Commands for cp & rsync:
darthshadow@server:~/max-pages$ time rsync 1G.img test-mount/
real 0m2.468s
user 0m2.938s
sys 0m0.649s
darthshadow@server:~/max-pages$ time cp 1G.img test-mount/
real 0m0.962s
user 0m0.009s
sys 0m0.369s
Have you straced rsync/cp to see what sizes they are using to write? Best to check using dd and explicitly setting obs to the size you want per write call. 128K has traditionally been about the sweet spot for copying. With FUSE given the increased latency especially it is less so.
mergerfs doesn't (currently) support FUSE writeback caching. You'd have to use master branch for that and enable it. That would make the kernel batch upto 1MB worth of writes and then send it to mergerfs. Details are in the docs.
BTW... if you have caching enabled (which you do since you aren't disabling it using cache.files=off or direct_io=true) then it is very likely you're getting a getxattr request after every write which will seriously harm performance. Unfortunately, there isn't a good way to handle it. The kernel doesn't yet cache results. mergerfs has security_capability=false which can short circuit it so it doesn't go to the underlying filesystem but it only helps so much. The best is xattr=nosys but that turns off xattr's all together. No in between right now.
Thanks for the suggestions. strace did reveal that cp & rsync were sending the 128k & 256k writes. rsync also had some select blocking which could explain the increased copy time even with the bigger write size.
dd is sending the expected read and write requests based on the params specified so this is ideal for more testing now.
Unfortunately, however, I didn't notice any significant different in the write speeds with & without max-pages on both rclone & mergerfs, however, mergerfs was showing almost 2x the speeds of rclone (and could probably go even faster, since my drive write throughput got maxed out at those speeds).
Reads seemed to be similar with & without max-pages in mergerfs but this is probably because my drive read throughput is maxed out at those speeds.
rclone was slightly slower than mergerfs but still had a 3x-4x throughtput increase than without the max-pages.
Unfortunately, the side effect right now is that turning xattrs off means no runtime config and loss of certain other features in mergerfs. My roadmap includes finding alternative ways to offer the same features given the general impact xattrs can have.
I was able to try the experiments on a significantly faster disk which shouldn't have the throughput bottlenecks and the results appear to be similar:
MergerFS:
Read & Write Speeds of ~ 800 MB/s - 1 GB/s with fuse_msg_size set to 32. ~ 100-200 MB/s improvement after setting it to 256.
RClone:
Read Speeds of 150-200 MB/s (with max-pages set to 32 or without it) and an increase to 400-500 MB/s (with max-pages set to 256). Write Speeds of 400-500 MB/s, both with & without max-pages.
@ncw I think we can open this for further testing once you merge the latest changes too for the builds.
PS: Is the 2x or greater difference between rclone & mergerfs (for both reads & writes) simply due to the fuse libraries or can something be done to improve the performance of rclone?
Good question... It is almost certainly due to excess data copying. A bit of careful profiling might reveal the problem! Go has excellent profiling tools.
Sounds good, I will spend some time the next few days to familiarize myself with those and see if there are any obvious bottlenecks. Any tips or guides you can recommend to get started?
In the meanwhile, this looks like a good enough performance boost to get started with for general testing by a few more folks.
I'd keep an eye out for cold spots. When dealing with IO the problems often aren't things typical profiling will find. It's often cold spots and latency. I've been meaning to do cold spot profiling of mergerfs for a while but haven't gotten around to it so unfortunately I can't offer you any practical suggestions.