Rclone crashes macOS on local transfers

What happens if you try --checkers=1 --transfers=1 --check-first, it also removes concurrent checks and transfers.

How much data is typically needed to make this happen? (no of folders, files, avg file size, total size)

I am collecting info to possibly reproduce using rclone test makefiles

I think we need to find and fix the root cause, seems to be some kind of race condition. We may just be seeing the tip of the iceberg.

Can you confirm that the issue doesn't exist if using this tip:

Thanks, very useful information.

Did you test and see the issue in the immediately following release, that is v1.60.0?

How do I get the logs? -vv just dead ends in mid flight, so there's no useful info there. Is there something I can export from the console in macOS that would be useful?

You can add --log-file=rclonelog.txt to write the log to a file that can be examined after the crash.

I read the logs you have posted so far and did some searching on the error messages.

Most of what I see points towards an issue with Mac hardware/software or cabling to the disk/network triggered by the high I/O load generated by rclone.

The only thing indicating an issue with rclone/cgofuse/macfuse is @vicmarto experiencing no issues if going back to v1.59.2, but this haven't been confirmed by anyone else yet.

I therefore suggest you try verifying no issues when using v.1.59.2 and reappearing issues when using v1.60.0. That will allow us to narrow down on rclone changes that may have introduced the issue.

It would also help troubleshooting if you could check

  • if adding --checkers=1 --transfers=1 --check-first has any stabilizing effect.
  • if adding --bwlimit=10M (or similar value significantly lower than normal speed) has any stabilizing effect.
  • if adding adding both the above has any stabilizing effect

Today I did some testing with the 11 latest versions of rclone:

  • The source is from a macOS 13.1 Ventura.
  • The destination is local, via sftp, to a TrueNAS machine with zfs.

And... the pattern on all of them is the same:

  1. Lots of activity for about 70 seconds (with CPU usage close to 200% usage on the source and destination machine).
  2. A dry stop (apparently on the source machine?) until CPU usage drops to 1%. The target machine drops to 50%.
    The resulting log has the same weight on all 11 versions of rclone.

This time, unlike before, version 1.59.2 also failed to complete the process!
I'm not quite sure why... The only major difference is that I rebooted both machines before and now I did not.

Here are the logs (modified to preserve privacy).

Hi vicmarto,

I will look into the logs, can you meanwhile test these flags (on the latest version):

@vicmarto In your earlier post you wrote:

however the command your are executing:

2023/01/23 20:03:41 DEBUG : SOME_FILE_IN_SOURCE :Version "v1.61.1" starting with parameters ["/usr/local/bin/rclone" "sync" "--copy-links" "--create-empty-src-dirs" "--track-renames" "--local-no-check-updated" "--progress" "-vv" "--log-file=/Users/vicmarto/rclone_1.61.1.log" "--exclude" "SOME_PATH" "/Volumes/Users/vicmarto" "nas:/mnt/zbackup/Users/vicmarto"]

has this target: "nas:/mnt/zbackup/Users/vicmarto" which seems to be an SFTP backend. Could you please post the redacted output from rclone config show nas: to make it clear exactly what your target is?

Is /mnt/zbackup/Users/vicmarto a local disk on the SFTP server or something mounted? What? How?

EDIT: I now see you posted this:

So you are referring to local network transfers, not local machine/disk transfers like the original post:

@Ole:

rclone.conf is:

[nas]
type = sftp
host = 10.0.1.150
user = vicmarto
key_file = /Users/vicmarto/.ssh/id_rsa_nas
key_file_pass = EDITED
md5sum_command = md5 -r
sha1sum_command = sha1 -r
shell_type = unix

That's true: transfer is to a SFTP backend.

Thanks @vicmarto !

I am looking in the log from rclone 1.61.1 which starts 20:03:41 and has an abrupt end after 67 minutes at 21:10:35. At what time starts the 70 seconds where CPU rises to 200%?

Were there any ongoing transfers at that point or just checks (with checksum calculations)? (I cannot see in the log)

Yes, I forced to cancel the 11 transfers, due to its extreme slowness.

The issues seem to appear when the rclone checkers performs a series of md5 calculations to see if files have changed. Perhaps (some) Mac's can be overloaded by doing many concurrent md5 calculations.

Are you able to observe something similar when executing this command:

rclone check "/Volumes/Users/vicmarto" "/Volumes/Users/vicmarto"  "--exclude" "SOME_PATH" "--progress" "-vv" "--log-file=/Users/vicmarto/rclone_check.log"

It will stress the Mac by calculating and comparing checksums for all the files in /Volumes/Users/vicmarto. It is intentional that I compare the files with themselves, that will eliminate the network and SFTP server - and increase load.

I ran the same test again with v.1.61.1 after upgrading to macOS 13.2 Ventura. Here is the log (modified).

In summary, the sequence would be:

  • Contrary to @wdp, I have no hangs.
  • At startup, rsync seems to behave as expected (CPU usage at 200%, transfer 200MB/s
  • About a minute later: gradually the transfer rate drops to 0Mb/s, and CPU usage suddenly drops to 1%.
  • The transfer becomes so slow that it never ends.
  • I manually cancel the transfer.

I have the feeling that the problems start in the md5 calculation phase (but this is not certain).

Thanks @Ole for your help.

That "rclone check" test seems to be fine. I ran it for about an hour with a constant 200% CPU usage, as expected.

Thanks @vicmarto both tests are very useful.

I am not sure you see the same issue as the others. What you see could also be related to a networking issue, so @wdp and @Marcelloh please chime in if you have supporting or different observations.

Now, let's try narrowing down the possibilities again:

What happens if you start your command with --check-first which will make rclone complete all the checks before starting the transfers?

If --check-first makes it

  • succeed without issues, then test using --check-first --checkers=12 --transfers=12 to increase concurrency.
  • fail during the checking, then test using --check-first --checkers=1 to reduce concurrency during checking.
  • fail during the transfers, then test using --check-first --transfers=1 to reduce concurrency during transfers.

Unfortunately, I don't see any major change using '--check-first': at some point the CPU usage of rclone drops to 1% and the transfer process takes forever...

Right now we are not trying to solve the issue, we are trying to locate the cause of the issue to be able to (hopefully) solve it.

Does it happen while checking or transferring?

Hi @Ole, I'll try it another time and report back.

Something that happens often and I would like to report, is what it seems, difficulties in completing the files when reaching 100%:

rclone calculates and compares the checksums after transfer, which can give this picture if one (or both) remotes takes some time to calculate it.

You could try adding --ignore-checksum to prove/disprove this. Please be aware of this in the docs:

You should only use it if ... you are sure you might want to transfer potentially corrupted data.

You could also try executing 4 concurrent "md5 -r" commands on your SFTP server on the above files, to see how fast your SFTP server can do it. And similarly on your Mac.

Maybe the problem is in the SFTP server and rclone is working fine?!

The Mac can calculate the four md5 in 70 seconds. And the server needs 4 minutes and 20 seconds...

Yes, that is one out of several open possibilities. We need more tests/observations to say for sure.

Can you tell if the manual md5 calculations on the server is limited by CPU, disk or something else? Which is (almost) at 100% usage?

Perfect, some things to pay attention to during the test:

First of all, is it the checking or the transfers that are slow? Does the speed vary during each of the phases? If so please note the time when good and bad speed, so we can compare to the activities we see in the debug log.

Secondly, can you identify the limiting factor during each of the two phases? Is it Mac CPU, Mac Disc, Network speed, Server CPU, Server Disk, or something else?