Rclone sync from local to GoogleDrive hanging at 100%

Hello Nick,

Here it is: https://github.com/rclone/rclone/issues/4532

I think I see why you think so: it's because that message is logged immediately before the last chunk being transferred, right? In that case, I'm not so sure: the problem happened again in tonight's backup, and the only such message was very, very far from the last chunk transfer (actually, more near to the middle of the upload); please see here for complete -vv log of this second occurrence: https://termbin.com/ziez

But if you think it would help, I can try removing the --bwlimit option and see if the problem disappears (would have to wait some days at least to be reasonably sure, as the bug is intermittent).

Please let me know how you wish me to proceed.

Cheers,
-- Durval.

If it is easy to rule out then it is probably worth doing.

I won't be able to look at this in detail until next week.

1 Like

Hello Nick,

Sure thing, a single edit on a couple of scripts; just did it, will keep this topic posted.

No hurry, please take your time :+1:

1 Like

Hello Nick,

Last night's backup ran without any --bwlimit options, and even so the problem manifested itself; please see https://termbin.com/gm1d

(BTW, I'm impressed on how collaborative this previously intermittent bug is -- now that we have -vv on, it's always showing up for work, every single night. Perhaps we can award it a medal or something -- posthumously, of course :slight_smile: )

Seriously now, the only constant thing I can see among all occurrences is that first the file gets modified (always to a larger size) while the hanging rclone is running (as duly recorded by a "source file is being updated" message), and then the last chunk upload message falls short of it (this time, the file grew from 1728315189 to 1729363713, and the last chunk upload ended at 1728053248 + 261941 = 1728315189 -- coincidence or not, the exact same "old" size -- ie, the size the file had when rclone started running).

So, it might be a red herring -- but in this case, it's very reddy-herringly one :wink:

I guess we have enough data for debugging this already -- so I'm reverting from -vv back to -v in order to save a few GBs every night on my logs. If more data is needed, please let me know and I will provide it ASAP.

Cheers, and thanks again,
-- Durval.

He he! Perhaps we'll mount and stuff the bug in the trophy room :wink:

That is very suggestive... and should be reasonably easy to try to reproduce.

I had a go at reproducing it - I didn't succeed. But one thing I noticed that you are using v1.50.2 which behaves quite differently to the current v1.52.3 - can you give that a go?

Hi Nick,

Yeah LOL :+1:

:expressionless: Here, it keeps repeating every night. OTOH, here it runs inn a loop: when my script starts borg, it simultaneously (in a background job) runs a loop callingrclone sync to copy the modified borg data "on the go" to GDrive. It's only after dozens of runs that rclone finally hangs, so that could be why you could not reproduce it yet (you probably run it once or a few times).

Sure thing, Nick! The only reason I haven't upgraded so far is that I reviewed the changelog and saw nothing that would explicitly have fixed it (or even something similar).

I have just upgraded it to 1.52.3 (from the brew repo -- which I use here). Right now the rclone-in-a-loop script is running, so the next rclone sync executed by the backup loop will be using this new version. If the bug happens again, I will be able to send you a -v log as soon as later today (if you need a -v -v log, it will have to wait until tomorrow night/tuesday morning because the script that runs the rclone loop is already running and I can't change it on the run -- we would have to wait for the next backup).

Will keep you posted!

Thanks again,
-- Durval.

BTW, would you also need a new SIGQUIT backtrace, with v1.52.3 this time? If yes, please let me know and I will provide it.

If it locks up a sigquit would be helpful.

There is a definite change to this area in the beta so it might be worth trying that too if this fails.

Sure thing. Can't guarantee it for today (with -v) but you can count on it for tomorrow along with the -vv output.

Humrmrmrm... The dirtree is 15TB size-wise, and it took me almost 2 months to upload to GDrive. Do you think the beta has any chance of screwing things up there? If yes, I'd rather wait for the beta to become an official release...

Cheers,
-- Durval.

Hi Nick,

The run using v1.52.3 presented a totally different behavior: rclone complained a lot more (like, many times on a single run, as opposed to just once in the previous version) that the source file was being updated, and then restarted the transfer -- with the result that, when the process was killed by the script to run it for the final time (after borg had finished updating the repo) it was still at 6%.

So, the "hang at 100%" issue did not manifest itself with the new version -- but basically because this version did not manage to hit 100% (due to the above behavior) before being automatically killed.

I will let the new version run for the next executions -- and with -vv and SIGQUIT -- and let's see how it behaves. When/if I manage to get it hung again, I will post in this thread along with logs and traceback.

Cheers,
-- Durval.

This is the reason for the change in behaviour

The beta has this new flag in which is probably exactly what you want

In particular if you use the beta with the --local-no-updated flag rclone will make a best effort attempt to upload your file. Provided the file was only appended and not modified internally rclone will successfully upload files that are being modified with this flag.

All the integration tests for drive and local are passing and I'm expecting to release 1.53 shortly so I think chances of it messing up are very low :crossed_fingers:

1 Like

Hello Nick,

Thanks for your detailed and attentive response. I think I understand what you mean:

(a) It's possible/probable that the new "file updated" logic on 1.52.3+ has indirectly solved my problem, ie stop rclone from hanging in my particular situation;

(b) By upgrading to the current beta (or to the released version which will follow it) I can then use the --local-no-updated flag to altogether avoid these "file updated" errors/retries (as long as the files being modified are only being appended to, which I do think is the case here).

Anyway, I have now modified my backup script to run rclone with -vv and to kill the final one in the loop (which will be the hanged one, as long as it has hanged) with SIGQUIT in order to provide the traceback.

If the hanging at 100% issue happens again, I will let you know and provide you with the log containing the -vv output and the traceback (and then probably try upgrading to the beta and using --local-no-updated flag to see whether it solves the problem).

If the problem doesn't happen again for the next 20 backups or so, I think we can consider it solved directly or indirectly by the upgrade to version 1.52.3, and I will suitably report here and close it on github.

Cheers,
-- Durval.

I think your summary is good. Look forward to hearing about your test results :smiley:

1 Like

Hello @ncw,

Had the first run yesterday night with the new 1.52.3 version and the full -vv log, and it did not hang. I went through the log and compared to the one for the previous 1.50.2 version, and here's what stood up:

v1.50.2:

2020/08/26 03:13:53 DEBUG : borg/198296598382e4b9ac3d494d5a6b94f043a1accf9baa042f4db37929e8fae6f3/files: Reopening on read failure after 4190208 bytes: retry 1/1000: can't copy - source file is being updated (size changed from 1728315189 to 1729363713)

v.1.52.3:

2020/09/01 10:03:08 ERROR : Attempt 1/3 failed with 1 errors and: Post "https://www.googleapis.com/upload/drive/v3/files/1ViQh68V_sJ72EQtDCc1Wld7QwUzSZdcD?alt=json&fields=id%2Cname%2Csize%2Cmd5Checksum%2Ctrashed
%2CmodifiedTime%2CcreatedTime%2CmimeType%2Cparents%2CwebViewLink%2CshortcutDetails&setModifiedDate=true&supportsAllDrives=true&uploadType=resumable&upload_id=ABg5-Ux3afyXbnhalkJf-OXH8edPtYQ7avjSAiPvv98wg-k2w4JNX
bsf_JvOupJZfh_E_RTYLI4Les_SuYcla0FmOyM": can't copy - source file is being updated (size changed from 6053959917 to 6055008438)

So, it seems to me that the kind of error generated by rclone detecting the updated file has changed from a "low-level" error (which in my case is retried up to 1000 times) to a "normal" error (which is retried just 3 times). I think this explains why there were many more "complaints" by the new rclone on the simple -v log I analized before.

I also see that rclone now checks all files again before retrying (which I think is the default "normal" error behavior) and perhaps this is why it has stopped running: the logic has changed so much that the previous bug just doesn't happen anymore.

Anyway, I will keep it running with -vv and killing the last one in the loop with SIGQUIT at least for a full month -- if no hangs happen, I will report it here early next month and then close the github issue.

Thanks again,
-- Durval.

Thanks for your comprehensive report - let us know what happens.

Yes that is what the change did - it make the error into a non-retryable error so it would bubble up.

On the rescan of files it will pick up the new size so should be transferred properly.

:slight_smile:

I'll just note that v1.53 has now been released with the --local-no-updated flag if you wanted to try that.

1 Like

Thanks for the heads-up re: v1.53, Nick -- I will certainly have a look and then upgrade in the near future (too much on my plate right now).

I'm having another issue now: after I changed my controlling script to end rclone with SIGQUIT instead of SIGTERM, rclone is now exiting with return code 2 instead of 143 (ie, it's no longer returning as being killed by a signal). This is screwing up a test in another script which tries to determine whether rclone was killed (and so that other script should also terminate) or whether it finished with some other condition (in which case it simply runs rclone again).

Shouldn't rclone exit with a code greater that 128 when killed by SIGQUIT (for example, by disabling the signal handler and then signaling itself with SIGQUIT again), or at least exit with a very distinct code that can be particularly checked? I mean, 2 looks awfully "normal" for a SIGQUIT ending...

Cheers,
-- Durval.

SIGQUIT is handled by the go runtime not by rclone. So I guess it is catching it and printing the backtrace then doing a normal error exit. It probably isn't possible to fix that actually as you'll only get the >128 error codes if the process actually died because of the signal. Maybe!

I'd change it back to SIGTERM as I think :crossed_fingers: we've got to the bottom of things!

1 Like

Thanks for the info, Nick. I would never have guessed SIGQUIT was handled that way (via traceback and exit) by the golang runtime, as this is totally "out-of-the-line" in terms of the normal *ix way of doing things (which is dumping core and exiting with a code >= 128).

Anyway, it could at least have used a more distinguished exit code instead of 2...

But anyway, I just implemented a workaround so the controlling script now continues to kill rclone with SIGQUIT, and then immediately kills the other, 'rclone-on-a-loop' script with SIGTERM (that way, this one no longer needs the kill exit code from rclone to finish).

I agree with you that we probably got to the bottom of things, but better not to tempt Murphy's Demon with the possibility of the issue just having got less probable/repeatable, and then happening without our being able to capture its backtrace.... :slight_smile:

Anyway, we're now at D-25 days and counting for me to consider this fixed. Let's hope the issue is really fixed, or that it shows up until then :slight_smile:

Cheers,
-- Durval.

1 Like

Some googling returned this:
https://golang.org/pkg/runtime/

The GOTRACEBACK variable controls the amount of output generated when a Go program fails due to an unrecovered panic or an unexpected runtime condition. By default, a failure prints a stack trace for the current goroutine, eliding functions internal to the run-time system, and then exits with exit code 2.

So it seems Golang puts a SIGQUIT in the "unexpected runtime condition" category (not what what I would have done, or expected).

But, further on:

GOTRACEBACK=system is like “all” but adds stack frames for run-time functions and shows goroutines created internally by the run-time. GOTRACEBACK=crash is like “system” but crashes in an operating system-specific manner instead of exiting. For example, on Unix systems, the crash raises SIGABRT to trigger a core dump.

Heh! So it seems that I can get what I wish just by setting GOTRACEBACK=crash.

Would that be a problem for a possible rclone crash analysis? Please note that the information dumped by the backtrace would also be changed, as stated above.

Cheers,
-- Durval.

I think that would be fine. The extra stuff in the backtrace is no problem.

Did you try a test killing rclone with QUIT with this env var?