hey! been using rclone to sync stuff to a WebDAV server, which mostly works great love how flexible this thing is!
but, I've noticed that it keeps opening a new http session / tcp connection for every request, which causes a bit of a slowdown... And it looks like opening new connections doesn't parallelize, so for really small files you'll get the same transfer speed regardless of how many threads you use (so --transfers 1 --checkers 1 runs at the same speed as --transfers 8 --checkers 8 for instance).
the same thing happens if you run rclone on both ends, so i'll provide an example with that.
am I right in guessing that connection:keep-alive / session reuse is not supported over WebDAV? or perhaps I forgot an important flag somewhere?
Run the command 'rclone version' and share the full output of the command.
Rclone should be using persistent http connections in the client at least. Not so sure about the server but I think it should be also.
Looking at your wireshark pic I see the same port number repeated so there is some connection re-use but not much - I agree. Rclone sets the number of connections to hold on to to --checkers + --transfers + 1 I think, so there should only be 3 persistent connections.
I think this is more likely to be a server problem than a client problem.
When I try a test with a nextcloud instance rclone opens --transfers connections and keeps re-using them.
When I try a test with the rclone serve webdav I see exactly what you see.
I checked the server code - we set the idle timeout to 60 seconds, so the connections should hang around that long for re-use.
It looks like the server closes the connections after every two transactions. Or at least the connections get closed...
Interestingly when I ran the client with -vv --dump bodies (which will tend to serialize the connections) it re-uses the connection perfectly.
Can you use your wireshark skills to find out whether the client or the server initiates the close of the TCP connection? That would be useful information.
The curious thing is I first noticed this beavior when running rclone against my own webdav server, which I think is fairly compliant -- at least davfs2 did everything over one connection when I used that instead of rclone... but maybe davfs2 is just very forgiving
I recall checking which side initiated the connection shutdown, and I'm 99% sure it was rclone-client -- but I'll double-check that as soon as I'm home. I have not tried running davfs2 against rclone-server, so I want to try that as well (unless you beat me to it!)
alright, so while running an rclone sync against my own server, I noticed that rclone-client will disconnect when receiving a response to a MKCOL or PUT if the server response has a non-zero content length. The RFC is a bit ambiguous on whether this is permitted, so I've gone ahead and removed the response bodies from my server. One problem down :>
but, looks like that's not all -- rclone-client may randomly panic-close a connection when it receives a 207 Multi-Status response -- it sends the server a tcp packet with the RST flag (the FIN flag would have been a normal shutdown request). I can't tell why this is; the response headers and body is identical to all the other ones, and sometimes it happens in bursts for a handful of files.
there are some more peculiarities to that last issue,
no warning or anything when it happens, not even in -vv
it happens much more often over HTTPS than over HTTP
like you mentioned, it goes away if you add --dump bodies to the client
and it also goes away if you simulate a 10ms latency to the network; sudo tc qdisc replace dev lo root netem delay 10ms (remove it with sudo tc qdisc delete dev lo root)
could be a race perhaps? I've done all the tests with --transfers 1 --checkers 1 for simplicity, but it behaves similarly with other values too
regarding davfs2 - it successfully does all uploads (PUT, UNLOCK) over a single session when copying files to an rclone webdav server. There's a batch of initial calls (HEAD, MKCOL, LOCK) which are done on separate connections, but it behaves the same for all webdav servers... maybe LOCK is an expensive operation on some servers and that's why they did it that way? Just guessing...
If you read this bit of the Go docs, that will start to make sense
If the returned error is nil, the Response will contain a non-nil Body which the user is expected to close. If the Body is not both read to EOF and closed, the Client's underlying RoundTripper (typically Transport) may not be able to re-use a persistent TCP connection to the server for a subsequent "keep-alive" request.
So if rclone is sent stuff in a body that it doesn't read then the connection can't be re-used.
Luckily this is all abstracted through an internal library and adding a bit of draining the bodies there appears to have fixed the problems with the rclone server at least.
This is one bit of the puzzle I'm not sure about. I think the http library can drain http connections on its own (but I can't find that bit of code). Maybe it only does that if they have been idle for a while.
The random disconnects would happen even when the server replied without a body (content-length 0), but the beta you linked seems to have fixed those as well, so everything regarding connection reuse looks solid now :>
However, I am surprised to say there was no performance gain. There might be some hints in this wireshark screenshot which was taken with rclone on both ends, and running the client with --transfers 8 --checkers 8,
when rclone-client receives a server response, it seems to wait for "exactly" 0.01 seconds before it sends the next request
it doesn't parallelize over multiple connections even with --transfers 8; instead it multiplexes all the requests on one TCP connection, running them all in series
I want to see if I can figure out what's causing the 0.01 sec delay, but I don't think I'll have a good shot at the multithreading...
Also let me know if you prefer to handle the remaining issues in a different thread or place :>
I've merged the keepalive fix to master now which means it will be in the latest beta in 15-30 minutes and released in v1.63
If you want to make a PR to make minsleep configurable you could copy these ones from the google drive backend
--drive-pacer-burst int Number of API calls to allow without sleeping (default 100)
--drive-pacer-min-sleep Duration Minimum time to sleep between API calls (default 100ms)
They don't need to be configurable in most backends but webdav has a lot of different providers.
Burst doesn't seem to be part of the default pacer however -- would it be alright to only expose min-sleep as a setting? or would it be preferable to add the burst parameter to the default pacer, so that other backends can use it as well?