Rclone shows incorrect amount of data transfer

What is the problem you are having with rclone?

Syncing from on WebDAV to another. There is only 900 GB of data, but it already shows that it has synced 1.6 TB of data. My WebDAV and the Hetzner Storage Box control panel only show 600 GB of usage.

The displayed average transfer speed (10 MB/s) is reasonable, and would mean that it has transferred 1.7 TB of data by now. What is it copying???

What is your rclone version (output from rclone version)

1.51.0

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Debian 9, 64 bits

Which cloud storage system are you using? (eg Google Drive)

WebDAV (backed by Nextcloud)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone -sync "remote:" "dest:" -P -v --checkers 8 --transfers 8

A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)

Transferred: 1.626T / 1.666 TBytes, 98%, 10.694 MBytes/s, ETA 1h4m33s
Errors: 166 (retrying may help)
Transferred: 55370 / 65386, 85%
Elapsed time: 44h17m29.1s
Transferring:

I started it again with -vv and got a bit more detail. Looks like it's trying the same files over and over?

2020-02-05 12:33:28 DEBUG : pacer: Reducing sleep to 632.8125ms
2020-02-05 12:33:28 DEBUG : OSes/Windows/Windows 10.iso: Received error: cURL error 28: Operation timed out after 30000 milliseconds with 0 out of -1 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html): Sabre\DAV\Exception: 500 Internal Server Error - low level retry 7/10
2020-02-05 12:33:28 DEBUG : pacer: Reducing sleep to 474.609375ms
2020-02-05 12:33:28 DEBUG : Videos/G0018115.avi: Received error: cURL error 28: Operation timed out after 30000 milliseconds with 0 out of -1 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html): Sabre\DAV\Exception: 500 Internal Server Error - low level retry 7/10
2020-02-05 12:33:29 DEBUG : pacer: Reducing sleep to 355.957031ms
2020-02-05 12:33:29 DEBUG : Videos/G0018115-001.avi: Received error: cURL error 28: Operation timed out after 30003 milliseconds with 0 out of -1 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html): Sabre\DAV\Exception: 500 Internal Server Error - low level retry 7/10
2020-02-05 12:33:29 DEBUG : pacer: Reducing sleep to 266.967773ms
2020-02-05 12:33:29 DEBUG : OSes/Windows/en_windows_server_2016_x64_dvd_9718492.iso: Received error: cURL error 28: Operation timed out after 30002 milliseconds with 0 out of -1 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html): Sabre\DAV\Exception: 500 Internal Server Error - low level retry 7/10
2020-02-05 12:33:30 DEBUG : pacer: Reducing sleep to 200.225829ms
2020-02-05 12:33:30 DEBUG : OSes/Linux/ubuntu-19.10-desktop-amd64.iso: Received error: cURL error 28: Operation timed out after 30000 milliseconds with 0 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html): Sabre\DAV\Exception: 500 Internal Server Error - low level retry 7/10
2020-02-05 12:33:30 DEBUG : pacer: Reducing sleep to 150.169371ms
2020-02-05 12:33:30 DEBUG : pacer: Reducing sleep to 112.627028ms
2020-02-05 12:33:31 DEBUG : pacer: Reducing sleep to 84.470271ms
2020-02-05 12:33:35 DEBUG : pacer: Reducing sleep to 63.352703ms
2020-02-05 12:33:36 DEBUG : pacer: Reducing sleep to 47.514527ms
2020-02-05 12:33:36 DEBUG : pacer: Reducing sleep to 35.635895ms
2020-02-05 12:33:36 DEBUG : pacer: Reducing sleep to 26.726921ms
2020-02-05 12:33:37 DEBUG : pacer: Reducing sleep to 20.04519ms

More details. I watched a bit closer and it seems that once a transfer gets to 99% there is a big delay (copying the file from /tmp maybe?) and the speed starts to slow down and show incorrectly (perhaps taking an average over the last X seconds even though it has actually stopped).

It would also appear that rclone stops responding on ALL curl threads, because it will trigger a timeout error for all of them simultaneously:

rclone will then do a "low level retry", and it will always end up with the same result. One of the large files will take too long to finalize and cause a timeout on all active threads. I believe more threads will make this behavior even worse.

I found out that Nextcloud has a hard-coded 30 second timeout (for some silly reason), and it required changing a line in the code according to this post

Once I made this change Nextcloud stopped hanging up after the delay and it allowed all the files to continue correctly.

Is there some improvement rclone can make to not stop responding for so long as it finalizes a file - especially on all of the other threads that are not finishing?

As you have correctly diagnosed, the wrong transfer size is here due to errors.
Rclone will count all files it ended up deciding to transmit - including anything it needed to re-transmit.
So when errors occur, the total transfer can be larger than the actual data size. Usually it will be an unnoticeable difference - but in this case it is fairly serious errors it looks like.

You can use --timeout 1h (or some other time like 10m) to control when rclone times out connections on it's end of things. However - this will not really control anything on the server-side (which seems to be the problem here). If an option to control this exists it will usually be a setting in the remote or other interface that you use - and it looks like you found the right value in this case. --timeout can definitely be useful if you move huge files across mounts with write-cache though ect. so rclone does not give up waiting for the file to arrive.

I don't think it would really be possible for rclone to override that from it's core functions since what was going wrong was happening in a different layer of the system that wasn't directly under it's purview.

This delay is caused by nextcloud "finalising" the file. As you've discovered you can tweak the nextcloud timeouts to fix it. There isn't a lot rclone can do here unfortunately.

The error message ("Received error: cURL error 28: Operation timed out after 30000 milliseconds with 0 bytes received") you see is from nextcloud not rclone. Rclone doesn't use libcurl

Great!

I think this entirely a problem with nextcloud unfortunately. Until nextcloud returns 200 OK to the upload, rclone doesn't know whether it has been successful or not, so it can't go on to the next one.

Just to clarify - I updated the timeout on the remote server that I am copying files from - not on the local server that we are copying files to. So I do not think it is the local server having a timeout issue while finalizing a file - I think it is the remote server that is forcibly closing the connection because it has not heard anything from us (while we are finalizing the file).

Considering this, I can only think of two possibilities:
1 - my local Nextcloud pauses all WebDAV copying while it is finalizing a file on any of the threads, and thus rclone waits on it, and then the remote server hangs up on us.
2 - rclone pauses all webdav threads while waiting for confirmation that a single file has been copied, and thus the remote serve hangs up on us.

On the target (local) server I see a number of php-fpm processes roughly equivalent to the transfers and checkers. This leads me to believe that it is not Nextcloud that is hanging all threads (processes) while one finishes - but rather rclone - since I only see a single rclone process.

I may be very wrong in my assumptions and I mean no offense - I just want to understand better and help improve rclone for future users. If in the end it is caused by Nextcloud's WebDAV implementation I would not be surprised.

Ah, OK!

Rclone won't close the incoming stream until the outgoing stream has finished.

I don't think 2) should be happening unless there is a lock out of place in rclone. The webdav backend doesn't need any locking so I think that shouldn't be the case.

Not sure why nexcloud would do 1), again unless there is some central locking on something.

That is what I'd expect.

Rclone is running threads (or more accurately goroutines) internally, there will be one corresponding to each of those php-fpm processes. You can use the debug facilities if you want to see what goroutines are running.

No offense taken! I'd like to work out what is going on here too. I don't think we've got to the bottom of it yet!

I'll just note that timeouts while finalizing big files are a recurring theme with rclone with various providers - however it is normally the destination of the file that times out.

I changed it to copy 1 file at a time and the problem went away. I think I found the cause of this, and perhaps something that was making my other reported issues even worse.

My server arrangement is that I'm using a server with a 40 GB drive (30 GB free) with Nextcloud installed, but using external storage. At first I tried S3 external storage, but that had issues with modification dates. Now I'm using a Hetzner storage box that supports either WebDAV or CIFS. So really, the file transfer looks like this:

Local Storage -> Remote Nextcloud (WebDAV) -> rclone -> Local Nextcloud (WebDAV) -> External Storage (WebDAV/CIFS)

The /tmp directory grows with files from php-fpm as they are being downloaded by rclone. So with the files listed above, that's already 22 GB. When I was monitoring a single file being copied, I noticed that sometimes the /tmp files was 40% bigger than the file it was copying. It could be because it adds some extra buffer space, or because it is a bit slow to clear out old entries, or perhaps there is a secondary temporary file in use by Nextcloud or rclone that is further eating up space.

So whatever the cause is, it seems like it is having a storage "race condition" when it gets too many big files at one time.

I have a suggestion for rclone that would help, and not just in my case. Is it possible for rclone to "blend" the files it is sending? When I have 4 (or 8, 12, etc.) threads copying files, it will get stuck on 4 large files and max out my 10 MB/s network usage. This results in all files taking longer to finish, and spending more time in the /tmp directory. Also, when it gets to a folder with thousands of small files the speed will plummet due to the overhead of so many small files. Wouldn't it be better to balance it so that a few threads are maximizing bandwidth use (~90%?) while the other threads are taking care of all the small files instead of saving them until the end?

I think this would improve performance by reducing the strain on the /tmp directory and increasing the overall throughput of the sync.

This is very similar to a suggestion made by @thestigma

I've made part of the machinery for this with an --order-by flag. This needs a bit of co-operation with the upload threads to make sure some only pick biggest and some pick smallest.

Sounds like a good suggestion :wink:

Maybe it could be simplified to overload the existing --transfers to accept a list of characters?

Something like:
--transfers SSss
would mean 2 threads sorting by Large files first (SS) and 2 threads sorting by small files first (ss).

Some set flags like size, modified, created, name with both an upper and lower case to indicate descending or ascending sorting (SsMmCcNn) could make it pretty easy for the user to specify how many threads of each type they want.

I was thinking of something like --order-by size,mixed,50 which means order by size but with 50% of the threads on large and 50% on small or something like that with 50% being the default if not specified.