How do I get the logs? -vv just dead ends in mid flight, so there's no useful info there. Is there something I can export from the console in macOS that would be useful?
You can add --log-file=rclonelog.txt to write the log to a file that can be examined after the crash.
I read the logs you have posted so far and did some searching on the error messages.
Most of what I see points towards an issue with Mac hardware/software or cabling to the disk/network triggered by the high I/O load generated by rclone.
The only thing indicating an issue with rclone/cgofuse/macfuse is @vicmarto experiencing no issues if going back to v1.59.2, but this haven't been confirmed by anyone else yet.
I therefore suggest you try verifying no issues when using v.1.59.2 and reappearing issues when using v1.60.0. That will allow us to narrow down on rclone changes that may have introduced the issue.
It would also help troubleshooting if you could check
if adding --checkers=1 --transfers=1 --check-first has any stabilizing effect.
if adding --bwlimit=10M (or similar value significantly lower than normal speed) has any stabilizing effect.
if adding adding both the above has any stabilizing effect
Today I did some testing with the 11 latest versions of rclone:
The source is from a macOS 13.1 Ventura.
The destination is local, via sftp, to a TrueNAS machine with zfs.
And... the pattern on all of them is the same:
Lots of activity for about 70 seconds (with CPU usage close to 200% usage on the source and destination machine).
A dry stop (apparently on the source machine?) until CPU usage drops to 1%. The target machine drops to 50%.
The resulting log has the same weight on all 11 versions of rclone.
This time, unlike before, version 1.59.2 also failed to complete the process!
I'm not quite sure why... The only major difference is that I rebooted both machines before and now I did not.
has this target: "nas:/mnt/zbackup/Users/vicmarto" which seems to be an SFTP backend. Could you please post the redacted output from rclone config show nas: to make it clear exactly what your target is?
Is /mnt/zbackup/Users/vicmarto a local disk on the SFTP server or something mounted? What? How?
EDIT: I now see you posted this:
So you are referring to local network transfers, not local machine/disk transfers like the original post:
I am looking in the log from rclone 1.61.1 which starts 20:03:41 and has an abrupt end after 67 minutes at 21:10:35. At what time starts the 70 seconds where CPU rises to 200%?
Were there any ongoing transfers at that point or just checks (with checksum calculations)? (I cannot see in the log)
The issues seem to appear when the rclone checkers performs a series of md5 calculations to see if files have changed. Perhaps (some) Mac's can be overloaded by doing many concurrent md5 calculations.
Are you able to observe something similar when executing this command:
It will stress the Mac by calculating and comparing checksums for all the files in /Volumes/Users/vicmarto. It is intentional that I compare the files with themselves, that will eliminate the network and SFTP server - and increase load.
I am not sure you see the same issue as the others. What you see could also be related to a networking issue, so @wdp and @Marcelloh please chime in if you have supporting or different observations.
Now, let's try narrowing down the possibilities again:
What happens if you start your command with --check-first which will make rclone complete all the checks before starting the transfers?
If --check-first makes it
succeed without issues, then test using --check-first --checkers=12 --transfers=12 to increase concurrency.
fail during the checking, then test using --check-first --checkers=1 to reduce concurrency during checking.
fail during the transfers, then test using --check-first --transfers=1 to reduce concurrency during transfers.
Unfortunately, I don't see any major change using '--check-first': at some point the CPU usage of rclone drops to 1% and the transfer process takes forever...
rclone calculates and compares the checksums after transfer, which can give this picture if one (or both) remotes takes some time to calculate it.
You could try adding --ignore-checksum to prove/disprove this. Please be aware of this in the docs:
You should only use it if ... you are sure you might want to transfer potentially corrupted data.
You could also try executing 4 concurrent "md5 -r" commands on your SFTP server on the above files, to see how fast your SFTP server can do it. And similarly on your Mac.
Yes, that is one out of several open possibilities. We need more tests/observations to say for sure.
Can you tell if the manual md5 calculations on the server is limited by CPU, disk or something else? Which is (almost) at 100% usage?
Perfect, some things to pay attention to during the test:
First of all, is it the checking or the transfers that are slow? Does the speed vary during each of the phases? If so please note the time when good and bad speed, so we can compare to the activities we see in the debug log.
Secondly, can you identify the limiting factor during each of the two phases? Is it Mac CPU, Mac Disc, Network speed, Server CPU, Server Disk, or something else?