Max Pages on Linux with rClone/MergerFS

darthShadow · February 12, 2020, 4:49pm

Thanks for the suggestion. Checked that too, it seems correct and reflects any changes in the max pages setting appropriately.

Is there any way to get debug statements in mergerfs that show the read & write sizes?

I just want to be sure that this is something related to the fuse implementation and not something in my system that's causing it. Will give me a specific area to debug further.

trapexit · February 12, 2020, 5:04pm

-d will run it in the foreground and spit out debug details. It's the standard libfuse debug info. Been meaning on rewriting to make it more useful but should give you the details you want.

darthShadow · February 12, 2020, 6:38pm

2 observations that seem questionable to me:

cp gives writes of 128k whereas rsync gives writes of 256k. Could this be some limitation with the program used instead of the fuse library itself? I was able to replicate the same behavior on both the rclone & mergerfs mounts.
Even though rsync has double the write size, the time taken by it is double the time taken by cp. This was observed only with the mergerfs mount and not with the rclone mount. I am not sure how this is happening...

Mergerfs Mount Command : mergerfs -d -o async_read=true -o fsname=test-mount /home/darthshadow/max-pages/source=RW /home/darthshadow/max-pages/test-mount

Mergerfs Version:

mergerfs version: 2.28.3
FUSE library version: 2.9.7-mergerfs_2.29.0
fusermount version: 2.9.7
using FUSE kernel interface version 7.29

Commands for cp & rsync:

darthshadow@server:~/max-pages$ time rsync 1G.img test-mount/

real    0m2.468s
user    0m2.938s
sys     0m0.649s
darthshadow@server:~/max-pages$ time cp 1G.img test-mount/

real    0m0.962s
user    0m0.009s
sys     0m0.369s

trapexit · February 12, 2020, 6:57pm

Have you straced rsync/cp to see what sizes they are using to write? Best to check using dd and explicitly setting obs to the size you want per write call. 128K has traditionally been about the sweet spot for copying. With FUSE given the increased latency especially it is less so.

mergerfs doesn't (currently) support FUSE writeback caching. You'd have to use master branch for that and enable it. That would make the kernel batch upto 1MB worth of writes and then send it to mergerfs. Details are in the docs.

trapexit · February 12, 2020, 7:10pm

BTW... if you have caching enabled (which you do since you aren't disabling it using cache.files=off or direct_io=true) then it is very likely you're getting a getxattr request after every write which will seriously harm performance. Unfortunately, there isn't a good way to handle it. The kernel doesn't yet cache results. mergerfs has security_capability=false which can short circuit it so it doesn't go to the underlying filesystem but it only helps so much. The best is xattr=nosys but that turns off xattr's all together. No in between right now.

darthShadow · February 12, 2020, 7:37pm

Thanks for the suggestions. strace did reveal that cp & rsync were sending the 128k & 256k writes. rsync also had some select blocking which could explain the increased copy time even with the bigger write size.

dd is sending the expected read and write requests based on the params specified so this is ideal for more testing now.

Unfortunately, however, I didn't notice any significant different in the write speeds with & without max-pages on both rclone & mergerfs, however, mergerfs was showing almost 2x the speeds of rclone (and could probably go even faster, since my drive write throughput got maxed out at those speeds).

Reads seemed to be similar with & without max-pages in mergerfs but this is probably because my drive read throughput is maxed out at those speeds.
rclone was slightly slower than mergerfs but still had a 3x-4x throughtput increase than without the max-pages.

ncw · February 12, 2020, 8:45pm

What does is -ld /mnt/mount look like? Does it look the same as when you use rclone mount?

Ah, that is broken... I'll investigate.

ncw · February 12, 2020, 8:47pm

Rclone doesn't currently support xattrs so that might be a good option.

trapexit · February 12, 2020, 9:09pm

Unfortunately, the side effect right now is that turning xattrs off means no runtime config and loss of certain other features in mergerfs. My roadmap includes finding alternative ways to offer the same features given the general impact xattrs can have.

darthShadow · February 13, 2020, 6:40am

I was able to try the experiments on a significantly faster disk which shouldn't have the throughput bottlenecks and the results appear to be similar:

MergerFS:

Read & Write Speeds of ~ 800 MB/s - 1 GB/s with fuse_msg_size set to 32. ~ 100-200 MB/s improvement after setting it to 256.

RClone:

Read Speeds of 150-200 MB/s (with max-pages set to 32 or without it) and an increase to 400-500 MB/s (with max-pages set to 256). Write Speeds of 400-500 MB/s, both with & without max-pages.

@ncw I think we can open this for further testing once you merge the latest changes too for the builds.

PS: Is the 2x or greater difference between rclone & mergerfs (for both reads & writes) simply due to the fuse libraries or can something be done to improve the performance of rclone?

ncw · February 13, 2020, 9:52am

Good question... It is almost certainly due to excess data copying. A bit of careful profiling might reveal the problem! Go has excellent profiling tools.

darthShadow · February 13, 2020, 10:14am

Sounds good, I will spend some time the next few days to familiarize myself with those and see if there are any obvious bottlenecks. Any tips or guides you can recommend to get started?

In the meanwhile, this looks like a good enough performance boost to get started with for general testing by a few more folks.

ncw · February 13, 2020, 12:36pm

Check out this bit of the rclone docs: Remote Control / API - that shows how to profile most things and there are some links elsewhere!

trapexit · February 13, 2020, 12:54pm

I'd keep an eye out for cold spots. When dealing with IO the problems often aren't things typical profiling will find. It's often cold spots and latency. I've been meaning to do cold spot profiling of mergerfs for a while but haven't gotten around to it so unfortunately I can't offer you any practical suggestions.

darthShadow · February 14, 2020, 5:13am

Flamegraphs:

Read: https://drive.google.com/file/d/1m5nHayy07mXkJY_X14H3NQW10qtu0702/view
Write: https://drive.google.com/file/d/1GGix_KLzPRs4egM34T4M0xdvjpmk20Ep/view

Looks like ~75% of the time is spent calculating the MD5 Hash for writes and ~50% of the time for reads.

ncw · February 14, 2020, 8:20am

Ha! Try your tests with --ignore-checksum to stop rclone checking hashes.

Nice graphs! How did you make those - with something like this?

docker run uber/go-torch -u http://<host ip>:8080/debug/pprof -p -t=30 > torch.svg

Maybe I should put instructions into the rclone debugging section on how to do it.

darthShadow · February 14, 2020, 8:31am

Didn't change anything. Still the same results.

Yes.

I can add those as part of the PR, not a problem.

ncw · February 14, 2020, 9:12am

Sorry try --no-checksum

Not sure why there are two flags doing nearly the same thing... --no-checksum is a VFS flag

darthShadow · February 14, 2020, 10:26am

This seems to have helped with the read speeds and they are the same as the speeds with mergerfs now.

However, writes are unaffected with the majority of the time still being spent at the same md5 block.

ncw · February 14, 2020, 11:41am

You might need to use both of those flags --ignore-checksum and --no-checksum

Which VFS cache mode are you using?

Uploads use the Rcat primitive when not using --vfs-cache-mode writes and use the Copy primitive when using it.

It might be the Rcat primitive is ignoring --ignore-checksum...