Rclone sync from local to GoogleDrive hanging at 100%

Just did, and unfortunately it doesn't work: with GOTRACEBACK=crash, the exit code from the QUIT-killed process is still 2 :expressionless: and it generated no core file either :frowning: So I guess the golang's documentation I quoted is either out-of-date, not-yet-implemented, or just plain wrong :unamused: The only thing that changed is that the traceback output is way larger than without it.

In case you are curious, here's my test and its complete output: https://termbin.com/jrjk

Cheers,
-- Durval.

I have a feeling that it is impossible to exit with a return code > 128 unless the signal actually killed the process. It's been a decade since I last looked at Advanced Programming in the Unix Environment (the Bible for this kind of thing) but that is my dim recollection!

Hi @ncw,

Yep, the goode olde Stevens book (along with his "TCP/IP Illustrated") is a timeless classic.

Sure thing. In fact, that's in the documentation for the exit syscall: "The value status & 0xFF is returned to the parent process as the process's exit status". As status is a signed int, this means it can only go from -127 to +127.

That's exactly what I meant, and the Golang documentation I mentioned above explicitly stated it would do with "GOTRACEBACK=crash", (re)quoting with emphasis: GOTRACEBACK=crash is like β€œsystem” but crashes in an operating system-specific manner instead of exiting. For example, on Unix systems, the crash raises SIGABRT to trigger a core dump.

So it seems Golang (or at least the runtime version used to compile the rclone binary I downloaded from rclone.org) doesn't follow its own documentation :frowning:

Cheers,
-- Durval.

I tried this on a very simple go program

package main

func main() {
	var p *int
	*p = 0			// crash crash crash
}

Which I compiled with

go build nilpointer.go 

Running it

$ ./nilpointer ; echo $?
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45da42]

goroutine 1 [running]:
main.main()
	/home/ncw/Go/nilpointer.go:5 +0x2
2

and

$ GOTRACEBACK=crash ./nilpointer ; echo $?
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45da42]

goroutine 1 [running]:
[snip traceback]
Aborted (core dumped)
134

So it looks like it does work for this use case...

So now with this program

package main

import "time"

func main() {
	time.Sleep(1*time.Hour)
}

And I see this when killing it with SIGQUIT

$ GOTRACEBACK=crash ./wait ; echo $?
SIGQUIT: quit
PC=0x45d7e0 m=0 sigcode=0

goroutine 0 [idle]:
[snip traceback]
Aborted (core dumped)
134

So it does work here!

So I guess rclone's signal handling is interfering somehow

However if I try it with rclone it works...

GOTRACEBACK=crash rclone size ~ ; echo $?
[snip]
Aborted (core dumped)
134

Puzzled!

1 Like

Hey Nick, thanks for taking the time to test this stuff, and to share the results with us!

You tested that on Linux, right? On this machine, it's Mac OSX (long story, but no choice of mine), so perhaps it's some Mac-aquisme?

I will try cross-compiling your very clear "sleep" program above and run it on the Mac, and see whether it shows the same behavior -- if it does, I think we can absolve rclone and blame Mac OSX (or at least Golang's OSX runtime).

I will let you know as soon as I do it.

Cheers,
-- Durval.

Yes I did.

I guess that is possible! Signals are one of the least portable aspects of the unix world...

:slight_smile:

1 Like

Hello @ncw,

Finally got the time to test the above on my OSX machine... and guess what? It produced a traceback, BUT still finished with exit code = 2 instead of > 128:

GOTRACEBACK=crash ./testtraceback ; echo $?
SIGQUIT: quit
[...]
2

So it's confirmed, Golang executable behavior re:SIGQUIT is definitely different in OSX.

For the record, I generated my "testtraceback" executable with the exact same code for the wait program you pasted above, by cross-compiling it from Linux running go version go1.12.2 linux/amd64 using the following command:

env GOOS=darwin GOARCH=amd64 go build testtraceback.go

And then transfer the resulting binary to my OSX machine with scp.

I gather you are using a much more recent Golang version to compile rclone v1.52.3 for OSX, correct? So, the problem definitely isn't limited to a specific Golang version.

Cheers,
-- Durval.

Hmm...

I don't know whether this is a bug in the Go runtime - did you find a doc where the exit code was specified? I suspect that the macOS runtime might be interfering somehow.

We could potentially report this as a Go issue.

Yes I'm using the latest stable release go1.15.x

If we wanted to report a bug then trying it with a later Go version would be necessary.

I'm not sure it is worth reporting as a bug though...

Yep, as I reported previously:

(and "raising SIGABRT to trigger a core dump" would certainly -- in any UNIX-like system -- cause the process to exit with code > 128)

I think the best way to further debug this would be to write a C program and compile it with Apple's MacOSX native compiler, to see whether the same thing happens with C code; if it doesn't (ie, if C code exits properly with code > 128 after trapping SIGQUIT, printing something and then SIGQUITting itself), then IMHO it's a bug that the Golang people should fix.

I don't have a OSX development machine, only this production server, and my policy is to not install anything in a production server that is not strictly needed for its productiion tasks (Apple's C compiler/development environment certainly isn't).

When I have more time (ha ha) perhaps I will bring up a VM running OSX here in my laptop, and then do this test. For the moment, let us (meaning me and everyone who comes a-googling) just be aware that one can't trust the Golang documentation on SIGQUIT regarding the exit code for the process.

Cheers,
-- Durval.

1 Like

I think there might be some wiggle room in the "crashes in an operating system-specific manner" part but I take your point :slight_smile:

Hello @ncw,

Agreed! But the part about "on Unix systems" raising "SIGABRT to trigger a core dump" IMO eliminates all wiggle room :wink:

Cheers,
-- Durval.

1 Like

Hello everyone,

Just keeping everyone posted: my backup procedure, after upgrading to rclone 1.52.3, went through a whole month without hanging up -- so I 'm considering this solved, and have already closed the issue on github with an appropriate, more detailed comment.

Thanks to @ncw and to everyone who chimed in!

1 Like

Hooray!

Thanks for the update :smiley:

Folks, one more update (may or may not be related): yesterday, using another GSuite account, and being accessed from another machine, another (this one, encrypted) GDrive large-ish remote (about 1TB total data, disposed over a few hundreds of files) being accessed through rclone mountby a md5sum -c, locked up hard; the md5sum -c stopped processing data and showed nothing on strace, and on ps auwx showed the dreaded D (uniterruptible sleep) state. Had to kill the rclone mount process for the md5sum to terminate.

This was with rclone 1.50.x; upgrading to 1.53.1, remounting the remote and rerunning md5sum -c fixed it, no more hangs (so far).

So perhaps this was also indirectly fixed by @ncw's latest changes; anyway, if anyone experiences hangs on processes accessing files through rclone mount, my suggestion is to upgrade and try again.

Cheers,
-- Durval.

1 Like