Macosx: network down when using rclone

Hi all,

I’m using a macbook pro, ethernet cable (not wifi), jumbo frames, and a very fast internet connection.
When I download (I generally use amazon drive for my pictures, so the files themselves are not too large), with the default options, I observe the following effect:

  1. spindump complains about rclone
  2. the ethernet card loses a packet, or something like that (I think this may be due to spindump freezing rclone for too long)
  3. the network goes down. also the local network goes down (e.g. I have nfs/samba folders mounted)
  4. rclone doesn’t report any error, just the average speed goes down (to zero).
  5. after a couple of minutes or so, the network comes back. rclone resumes downloading at regular speed
  6. goto 3 (note: goto 3)

Console log:

03/12/16 07:15:38,000 kernel[0]: process rclone[4667] caught causing excessive wakeups. Observed wakeups rate (per sec): 1249; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45002
03/12/16 07:15:38,692 com.apple.xpc.launchd[1]: (com.apple.ReportCrash.Root[4685]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash.DirectoryService
03/12/16 07:15:39,615 spindump[2075]: Saved wakeups_resource.diag report for rclone version ??? (???) to /Library/Logs/DiagnosticReports/rclone_2016-12-03-071539_chaosengine.wakeups_resource.diag
03/12/16 07:16:33,000 kernel[0]: vm_page_find_contiguous(num=644,low=1048575): found 0 pages at 0xffffffff000…scanned 681149 pages… yielded 675 times… dumped run 0 times… stole 0 pages… stole 0 compressed pages… wired count is 188383
03/12/16 07:16:33,000 kernel[0]: vm_page_find_contiguous: zone_gc called… wired count is 150785
03/12/16 07:16:35,000 kernel[0]: AppleYukon2: 00000000,000003e8 sk98osx sky2 - - sk98osx_sky2::replaceOrCopyPacket timedout
03/12/16 07:17:09,954 networkd[194]: -[NETAWDManager reportStats:metricID:] AWDServerConnection newMetricContainerWithIdentifier failed for metric 2686985, server 0x7ff99ac495a0, not reporting:
<AWDLibnetcoreStatsReport: 0x7ff99ac54a80> {
mbufStatisticsReport = {
mbuf16KBTotal = 1365;
mbuf256BTotal = 3024;
mbuf2KBTotal = 340;
mbuf4KBTotal = 872;
mbufDrainCount = 0;
mbufMemReleased = 0;
sockAtMBLimit = 0;
sockMBcnt = 37651;
};
networkdStatisticsReport = {
fallbackConnectionCount = 0;
totalConnectionCount = 1219;
totalSuccessfulConnectionCount = 642;
};
reportReason = 1;
}

the spindump for rclone is not informative, because obviously it’s not a debug build,
this problem is definitely connected with rclone (when I download large files via browser, I don’t see any such effect). I also suspect that it’s related to the download speed/bandwidth (when I connect via wifi to the router, apparently it doesn’t happen. also if I disable jumbo frames it happens less frequently).

I’m raising the problem here, because I had no problem with rclone 1.32, so I cannot rule out the possibility of a bug in rclone, but let me remark that I don’t know exactly what changed in the meanwhile (I may have installed macosx updates… Apple tends to break things silently all the time…)

Hmm, interesting!

rclone is very efficient at working the networking - I’ve had lots of reports of it breaking people’s routers etc, so I wonder if a similar effect is going on here.

What if you try setting a --bwlimit - does that make a difference?

You could also try setting --checkers 1 and --transfers 1 to see if that makes a difference.

If any of those help, then I think the most likely cause is an Ethernet driver update. I’ve seen quite a lot of Ethernet drivers over the years which work fine for light use, but go wrong when you stress them.

It’s definitely related to jumbo frames: if I disable them, the problem becomes much much harder to reproduce.
Is it possible that in rclone there’s some piece of code that goes crazy when the frame size is larger than usual?

rclone only deals with TCP - the OS should hide all those lower layers completely about frame size etc.

So I think it is unlikely, but not absolutely impossible!