Rclone sync from local to GoogleDrive hanging at 100%

What is the problem you are having with rclone?

rclone sync from local to GoogleDrive hanging at 100%: everything runs perfectly in the beginning of the sync, but sometimes, when it reaches 100% it just hangs there for hours and hours. The only way out is killing the process and then running it again with exactly the same command; so far the second execution always finishes OK (it's a very intermittent error).

What is your rclone version (output from rclone version)

rclone v1.50.2

  • os/arch: darwin/amd64
  • go version: go1.13.4

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Mac OSX High Sierra

Which cloud storage system are you using? (eg Google Drive)

Google Drive, with a legit (ie, >5 users) GSuite Unlimited account

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone -v --checkers=12 --transfers=8 --multi-thread-cutoff=1k \
  --multi-thread-streams=8  --low-level-retries=1000 --user-agent REDACTED \
  --rc --bwlimit "09:00,4m 21:00,off sat-09:01,off  sun-09:01,off" \
  --fast-list sync /REDACTED/REDACTED/REDACTED GDRIVE_REMOTE:REMOTEDIR

The rclone config contents with secrets removed.

[GDRIVE_REMOTE]
type = drive
client_id = REDACTED.apps.googleusercontent.com
client_secret = REDACTED
service_account_file = 
token = {"access_token":"REDACTED","token_type":"Bearer","refresh_token":"1/REDACTED","expiry":"2020-08-20T11:00:27.698728-03:00"}
root_folder_id = REDACTED

A log from the command with the -vv flag

Unfortunately I don't have a -vv log as the problem is intermittent and not easily reproducible, and I don't have the free disk space to run it with -vv all the time; I have it running with -v, here's the current log:

2020/08/20 03:21:37 NOTICE: Serving remote control on http://127.0.0.1:5572/
2020/08/20 03:22:37 NOTICE: Scheduled bandwidth change. Bandwidth limits disabled
2020/08/20 03:22:38 INFO  : 
Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors:                 0
Checks:                 0 / 0, -
Transferred:            0 / 0, -
Elapsed time:          0s

2020/08/20 03:23:07 INFO  : hints.34766: Copied (new)
2020/08/20 03:23:18 INFO  : Google drive root 'REDACTED': Waiting for checks to finish
2020/08/20 03:23:32 INFO  : integrity.34766: Copied (new)
2020/08/20 03:23:32 INFO  : Google drive root 'REDACTED': Waiting for transfers to finish
2020/08/20 03:23:38 INFO  : 
Transferred:      163.411k / 991.613 MBytes, 0%, 4.315 kBytes/s, ETA 2d17h20m17s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            2 / 6, 33%
Elapsed time:       37.8s
Transferring:
 *                                   index.34766:  0% /628.873M, 0/s, -
 *                                 data/34/34764:  0% /362.580M, 0/s, -
 *                                 data/34/34765:100% /17, 0/s, 0s
 *                                 data/34/34766:100% /17, 0/s, 0s

2020/08/20 03:24:37 INFO  : data/34/34766: Copied (new)
2020/08/20 03:24:38 INFO  : 
Transferred:      269.371M / 991.613 MBytes, 27%, 2.753 MBytes/s, ETA 4m22s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            3 / 6, 50%
Elapsed time:     1m37.8s
Transferring:
 *                                   index.34766: 13% /628.873M, 2.430M/s, 3m42s
 *                                 data/34/34764: 50% /362.580M, 3.018M/s, 59s
 *                                 data/34/34765:200% /17, 0/s, -

2020/08/20 03:25:21 INFO  : data/34/34765: Copied (new)
2020/08/20 03:25:38 INFO  : 
Transferred:      608.027M / 991.613 MBytes, 61%, 3.852 MBytes/s, ETA 1m39s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            4 / 6, 67%
Elapsed time:     2m37.8s
Transferring:
 *                                   index.34766: 41% /628.873M, 2.894M/s, 2m7s
 *                                 data/34/34764: 96% /362.580M, 2.817M/s, 4s

2020/08/20 03:26:38 INFO  : 
Transferred:      925.017M / 991.613 MBytes, 93%, 4.246 MBytes/s, ETA 15s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            4 / 6, 67%
Elapsed time:     3m37.8s
Transferring:
 *                                   index.34766: 89% /628.873M, 5.222M/s, 12s
 *                                 data/34/34764:100% /362.580M, 86.782k/s, 0s

2020/08/20 03:27:38 INFO  : 
Transferred:      991.613M / 991.613 MBytes, 100%, 3.569 MBytes/s, ETA 0s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            4 / 6, 67%
Elapsed time:     4m37.8s
Transferring:
 *                                   index.34766:100% /628.873M, 264.989k/s, 0s
 *                                 data/34/34764:100% /362.580M, 1.806k/s, 0s

2020/08/20 03:28:38 INFO  : 
Transferred:      991.613M / 991.613 MBytes, 100%, 2.935 MBytes/s, ETA 0s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            4 / 6, 67%
Elapsed time:     5m37.8s
Transferring:
 *                                   index.34766:100% /628.873M, 5.514k/s, 0s
 *                                 data/34/34764:100% /362.580M, 38/s, 0s

2020/08/20 03:29:38 INFO  : 
Transferred:      991.613M / 991.613 MBytes, 100%, 2.492 MBytes/s, ETA 0s
Errors:                 0
Checks:             30814 / 30814, 100%
Transferred:            4 / 6, 67%
Elapsed time:     6m37.8s
Transferring:
 *                                   index.34766:100% /628.873M, 117/s, 0s
 *                                 data/34/34764:100% /362.580M, 0/s, 0s

[This same block then repeats ad-nauseam, without any intervening errors/warnings/messages, until the process is killed]

If it helps, the data being sync'ed is a borgbackup repo tree with ~15TB total (only 1GB or so of it is currently modified each day).

I have one of those processes hanged right now; if there's anything I can try and take out of it (ie, a SIGQUIT core file, dtruss output, or whatever) please let me know. I can wait a few more hours before I'm forced to kill and restart it. Unfortunately, my use case does not me allow to run it with -vv all the time, as it would generate a ton of output and I don't have the disk space to store it all (free space is really tight on this machine).

Another idea would be to implement (if it's not implemented already) some special way to turn -vv on an already-running rclone process; say, SIGUSR1 or creating a file somewhere.

hello,
too soon to call this a bug?

1 Like

Can you kill -QUIT the rclone process. This will generate a backtrace - can you post that here? You can attach a .log file if it is too big! That should show exactly what is going on.

1 Like

https://rclone.org/rc/#options-set

1 Like

Right away, Nick! Here it is: https://termbin.com/whfq

(Sorry for not pasting directly and/or attaching it here, but that would be way more difficult on my current circumstances).

If you need anything else, please let me know.

Well it isn't a deadlock which is always my first suspicion...

This go routine is waiting for the transfers to finish

1: semacquire [545 minutes]
    sync       sema.go:56            runtime_Semacquire(*uint32(#65))
    sync       waitgroup.go:130      (*WaitGroup).Wait(*WaitGroup(#64))
    sync       sync.go:330           (*syncCopyMove).stopTransfers(#63)
    sync       sync.go:689           (*syncCopyMove).run(#63, #36, #19)
    sync       sync.go:889           runSyncCopyMove(#18, #36, #19, #82, #20, #48, #35, #12, #43)
    sync       sync.go:894           Sync(...)
    sync       sync.go:53            glob..func1.1(#73, #13)
    cmd        cmd.go:241            Run(#56, #24, #88)
    sync       sync.go:51            glob..func1(#24, #58, 0x2, 0xe)
    cobra      command.go:830        (*Command).execute(#24, #77, 0xe, 0x18, #24, #77)
    cobra      command.go:914        (*Command).ExecuteC(#23, #33, #25, #37)
    cobra      command.go:864        (*Command).Execute(...)
    cmd        cmd.go:506            Main()
    main       rclone.go:14          main()

And there are 2 transfers which haven't finished

2: chan receive [0~2 minutes] [Created by sync.(*syncCopyMove).startTransfers @ sync.go:322]
    pacer      pacer.go:159          (*Pacer).beginCall(#40)
    pacer      pacer.go:196          (*Pacer).call(#40, *, 0x3e8, 0x50, #9)
    pacer      pacer.go:216          (*Pacer).Call(#40, *, 0, 0)
    drive      upload.go:211         (*resumableUpload).Upload(*, #17, #54, *, 0x13b, *)
    drive      upload.go:111         (*Fs).Upload(#82, #17, #54, #15, *, *, #11, 0x18, 0, 0, ...)
    drive      drive.go:1793         (*Fs).PutUnchecked(#82, #17, #54, #15, *, #28, *, *, 0x1, 0x1, ...)
    drive      drive.go:1725         (*Fs).Put(#82, #17, #54, #15, *, #28, *, *, 0x1, 0x1, ...)
    operations operations.go:387     Copy(#17, #54, #19, #82, 0, 0, *, *, #21, *, ...)
    sync       sync.go:297           (*syncCopyMove).pairCopyOrMove(#63, #17, #54, #53, #19, #82, #64)

These are each waiting for a pacer token...

Here is the routine with the token, so the token hasn't leaked.

1: sleep [Created by pacer.(*Pacer).beginCall @ pacer.go:166]
    runtime    proc.go:310           goparkunlock(...)
    time       time.go:105           Sleep(16974333130)
    pacer      pacer.go:167          (*Pacer).beginCall.func1(#40, #34)

That sleep time is 16.9s which usually indicates something has gone quite wrong and rclone needed to retry a lot - maybe the 750 GB upload threshold has been breached or some other rate limit at google.

Rclone follows the google spec for backoff on errors and the peak time is 16s which takes ages to get stuff done!

If it was a deadlock I could have fixed it with the traceback, but I'm going to need a -vv log of it going wrong I'm afraid!

Hello @ncw,

Thanks for the quick and detailed analysis.

I will try and get you a -vv log the next time things go south.

Would it be enough a partial log from the moment I see a transfer stuck at 100% (using the rc trick @darthShadow mentioned)? Or should I try and bring up a syslog server in some other machine with enough disk space, as @asdffdsa suggested, to be able to get the -vv log from the beginning?

Cheers,

-- Durval.

The bit just before it gets stuck would be most interesting... I expect you are going to see lots of low level retries at that point.

Ok, I will try with -vv going the syslog way.

Will let you know as soon as I have it.

Cheers,
-- Durval.

Just keeping everyone posted: First night's backup with -vv and of course (Murphy's Law) the issue did not happen... what did happen was the backup log growing from ~580KB to ~1.8GB(!) :frowning:

Anyway, I'm going to keep -vv on for as long as I can. I just hope the issue shows up sometime in the next few days...

Hello @ncw,

I just reviewed my logs in detail, and I'm quite sure the 750GB limit has not been breached: on the 24h period where the issue happened, less than 20GB was uploaded on this account.

Of course, it could be some other limit; the backup script runs rclone sync from local to gdrive in a loop, to copy any and all new/modified files from the borg repo while borg create is running; as the repo has about 32K files (and borg takes about 6.5h to run), perhaps I'm reaching some kind of "transactions per day" limit (even using --fast-list which AFAIK reduces the amount of transactions tremendously).

Is anyone aware of any such limit?

Thanks in advance,
-- Durval.

you could enable compression on the folder that the log file is in.

image

Thanks for the suggestion, but it's being logged on a Unix system via syslog (no way I could run it on the Mac even with compression).

Trying to answer that myself, I searched and found this: https://stackoverflow.com/questions/10311969/what-is-the-limit-on-google-drive-api-usage

Most relevant part:

To view your allowed quota please create a project in the Google APIs Console. In the "Service" tab, the default quota allowed for each service is indicated.
Currently for the Drive API it reads "Courtesy limit: 1,000,000,000 queries/day". It's a per app quota.

I've checked that screen on the account and project used by the hanging rclone and it says the same ("1,000,000,000 queries/day"), but it also shows a graph with the daily usage for the last ~30 days, and my maximum usage was less than 80,000:

So, it doesn't seem to be that.

Is anyone aware of any other possible limits or quotas?

Staring at the Google Console, I had a sudden inspiration and went on to check for API errors (in a screen I remembered seeing in the past). Here's what I found:

So, there were 2.69% of errors in the same day (Aug 20) that my rclone process hanged; curiously, all errors were in the drive.about.get API method, which I would not have thought to be called intensively by rclone (and all other errors are zero, including drive.files.update).

I then downloaded the CSV to be able to see more detail, and here it is:

As you can see, all errors were between 04:33:21 and 05:02:09 UTC (which would be 01:33:21 and 05:02:09 at my timezone, GMT-3), and the issue here started after 03:26:38 GMT-3 (which is when the files hit 100% transferred but did not finish and instead hanged there (ie, over 1h earlier).

So, I don't think this hanging issue can be explained by anything causing API errors. And any quotas or limits would be expected to be recorded as errors, no?

So, that's what I have so far. Next "daily" backup will run sometime during Sunday, with Friday's data (long story, don't ask), let's see whether the issue happens then, as rclone will be run with -vv.

Cheers,
-- Durval.
`

drive.about.get is called by the mount to work out the disk usage. It shouldn't be called more often than once per minute.

Google definitely don't publish all of the quotas so I wouldn't be surprised if you are hitting an undocumented one.

1 Like

Hello @ncw,

By "the mount", do you refer to a running rclone mount? There's no rclone mount of this remote at all, just rclone sync.

Yeah, just my bad luck to run into it before anybody else :expressionless:

Anyway, it is my good luck that rclone has you as author/developer, thanks again for all the great assistance @ncw!

I think it might be called once per run to work out mappings to and from Google Doc mime types.

1 Like

Hi @ncw,

Good news: my latest backup just manifested the stuck-at-100% behavior, and I managed to obtain a complete, start-to-finish -vv log; please see it here: https://termbin.com/cpus

On this log, I see that a single 302MB file was found to be modified, and instead of uploading it in less than 1 minute as it should on my 50Mbps link, rclone sync just stuck around for almost 14 minutes, all the while reporting the transfer to be at 100% and without transferring any more data, until I could not wait anymore and had to kill it (tight backup window and all that). As in the past occurrences, after killing the hanged process I then manually ran a new rclone sync using the exact same command and options, and it finished without any issues.

We can also see on the log that absolutely nothing seems to be happening at the moment the transfer hangs: no Google Drive API errors, no retries, no new chunks being uploaded (in fact, there are no new chunks to upload: the last one was logged as being sent at 2020/08/23 16:18:28), no anything.

What I can see that could (perhaps) explain the issue is a message at 2020/08/23 16:17:30 reporting that "source file is being updated"; this is expected (the directory is being copied 'live' while it's being altered by borg-backup -- so the sync-up-to-the-cloud happens in parallel with the update, and after borg-backup finishes, a final rclone sync will copy any remaining changed files).

Anyway, the strange thing (and perhaps this is the bug) is that the "source file is being updated" message reports the file size as having changed "from 317522895 to 318571459"; and the final chunk upload message reports "310378496 length 71443992", which when added amounts to 381822488 -- which is not the new size reported by the "source file is being updated. So, perhaps the hang is being caused by some code in rclone expecting the transfer to reach the 318571459 mark, while the actual upload code is working with this other (381822488?) size?

OTOH (and that's why I wrote "perhaps" above), the hanged rclone sync that I reported in the OP to this topic was run after borg-backup had already finished -- therefore no 'changed file' could have happened there. So perhaps we are seeing different issues?

Nick, please let me know if you need anything else to run this down -- ditto if you want me to open a github issue for it.

Thanks again, and cheers,
-- Durval.

Great, bug captured!

This is a definite possibility. I'll have to work through the code and see whether that is possible.

There was also this in the log just before the last part

2020/08/23 16:18:27 NOTICE: Scheduled bandwidth change. Bandwidth limits disabled

Which could be relevant also. Could that have been the cause of the previous problem?

I think an issue would be a good idea. I'm not at home right now so can't investigate further but I don't want to forget about this!