1.49.4 / Plex / Internal Errors on Google Drive

Animosity022 · September 30, 2019, 5:52pm

What is the problem you are having with rclone?

Since my upgrade last night, I started to notice getting a lot of INTERNAL errors and being unable to download things if I spawned off a lot of scans. It gets to the point where the system gets stuck on a file and won't release it and the only option is to reboot.

I rolled back to 1.49.3 and am not seeing the issue.

What is your rclone version (output from `rclone version`)

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Ubuntu 18.04

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

 /usr/bin/rclone mount gcrypt: /GD --allow-other --dir-cache-time 100h --log-level INFO --log-file /opt/rclone/logs/rclone.log --timeout 1h --umask 002 --rc --rc-addr 127.0.0.1:557

A log from the command with the `-vv` flag (eg output from `rclone -vv copy /tmp remote:tmp`)

2019/09/30 13:36:34 INFO  : Google drive root 'crypt': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 1923; INTERNAL_ERROR
2019/09/30 13:42:04 INFO  : Google drive root 'crypt': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 1929; INTERNAL_ERROR

I can pretty much reproduce it as I tried on Fedora 30 as well and the problem is there as well. Once I get my system back in order, I can probably run some debug logs.

Animosity022 · September 30, 2019, 6:08pm

Those errors started for me spot on when I upgraded to 1.49.4.

2019/09/29 17:30:23 NOTICE: Serving remote control on http://127.0.0.1:5572/
2019/09/29 17:40:44 INFO  : Google drive root '': Change notify listener failure: Get https://www.googleapis.com/drive/v3/changes?alt=json&fields=nextPageToken%2CnewStartPageToken%2Cchanges%28fileId%2Cfile%28name%2Cparents%2CmimeType%29%29&includeItemsFromAllDrives=true&pageSize=1000&pageToken=114487&prettyPrint=false&supportsAllDrives=true&teamDriveId=0AHLaKhsyfucUUk9PVA: stream error: stream ID 2115; INTERNAL_ERROR
2019/09/29 17:47:12 NOTICE: Serving remote control on http://127.0.0.1:5572/
2019/09/29 18:30:28 INFO  : Google drive root '': Change notify listener failure: Get https://www.googleapis.com/drive/v3/changes?alt=json&fields=nextPageToken%2CnewStartPageToken%2Cchanges%28fileId%2Cfile%28name%2Cparents%2CmimeType%29%29&includeItemsFromAllDrives=true&pageSize=1000&pageToken=114487&prettyPrint=false&supportsAllDrives=true&teamDriveId=0AHLaKhsyfucUUk9PVA: stream error: stream ID 4423; INTERNAL_ERROR
2019/09/29 18:35:58 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 4425; INTERNAL_ERROR
2019/09/29 18:41:28 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 4427; INTERNAL_ERROR
2019/09/29 18:46:58 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 4429; INTERNAL_ERROR
2019/09/29 18:52:28 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 4431; INTERNAL_ERROR
2019/09/29 19:06:43 INFO  : Google drive root '': Change notify listener failure: Get https://www.googleapis.com/drive/v3/changes?alt=json&fields=nextPageToken%2CnewStartPageToken%2Cchanges%28fileId%2Cfile%28name%2Cparents%2CmimeType%29%29&includeItemsFromAllDrives=true&pageSize=1000&pageToken=114487&prettyPrint=false&supportsAllDrives=true&teamDriveId=0AHLaKhsyfucUUk9PVA: stream error: stream ID 1607; INTERNAL_ERROR
2019/09/29 19:12:13 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 1609; INTERNAL_ERROR
2019/09/29 19:17:43 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 1611; INTERNAL_ERROR
2019/09/29 19:23:13 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 1613; INTERNAL_ERROR
2019/09/29 19:28:43 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 1615; INTERNAL_ERROR
2019/09/29 19:28:58 Fatal error: failed to umount FUSE fs: exit status 1: fusermount: entry for /GD not found in /etc/mtab
2019/09/29 19:29:27 NOTICE: Serving remote control on http://127.0.0.1:5572/
2019/09/29 19:56:05 INFO  : Google drive root '': Change notify listener failure: Get https://www.googleapis.com/drive/v3/changes?alt=json&fields=nextPageToken%2CnewStartPageToken%2Cchanges%28fileId%2Cfile%28name%2Cparents%2CmimeType%29%29&includeItemsFromAllDrives=true&pageSize=1000&pageToken=114489&prettyPrint=false&supportsAllDrives=true&teamDriveId=0AHLaKhsyfucUUk9PVA: stream error: stream ID 2117; INTERNAL_ERROR
2019/09/29 20:06:27 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 2123; INTERNAL_ERROR
2019/09/29 20:11:57 INFO  : Google drive root '': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 2133; INTERN:

That's pretty crazy. I'm down at 1.49.3 and restored back my Plex and the things that weren't working as scanning now without any issue.

thestigma · September 30, 2019, 8:20pm

I'll try to reproduce.
What sort of load on the remote do you feel is triggering this? A lot of listings? As you know, I don't use Plex.

Animosity022 · September 30, 2019, 8:32pm

Didn't seem much load. I noticed that the directory cache would just go away too. I have to build something a little more isolated and I can do some testing.

Didn't seem like much load caused it. Plex is pretty bad in terms of wanting to open 20-30 files at a time though.

ncw · September 30, 2019, 8:42pm

There aren't any changes in the drive backend code or the mount or VFS...

One thing that did happen was a new point release of go though. Can you check go versions with the working and not working versions? (With rclone version)

Those are http2 errors by the look of it which again points at the go standard library...

Can you repro the problems with the beta?

thestigma · September 30, 2019, 8:42pm

this error should show under regular --verbose right?
I'm running a test-mount in the background and doing some various stuff with it while keeping an eye out for that error.

I haven't noticed anything strange on 1.49.4 yet, but my drives haven't been in normal use either since I'm building and testing a new suite of maintenance scripts.

ncw · September 30, 2019, 8:45pm

I just checked .. those errors are from the server not from the library.

Animosity022 · September 30, 2019, 8:51pm

I'm going to do some testing and see if I can create it without breaking my main streaming system as that makes the wife unhappy.

felix@gemini:/tmp$ rclone version
rclone v1.49.3
- os/arch: linux/amd64
- go version: go1.12.9

Works fine. I rolled back to this and basically dropped a library of ~2K movies and it's in the process of adding them back in. No issues, no locks, nada.

With 1.49.4, I was getting hung processes, which were stuck on IO (reading files from the mount)

Sep 30 13:27:01 gemini kernel: [ 9909.237230] INFO: task Plex Media Scan:9381 blocked for more than 120 seconds.
Sep 30 13:27:01 gemini kernel: [ 9909.237299]       Not tainted 5.0.0-29-generic #31~18.04.1-Ubuntu
Sep 30 13:27:01 gemini kernel: [ 9909.237356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 30 13:27:01 gemini kernel: [ 9909.237428] Plex Media Scan D    0  9381   7352 0x00000004

Once that happens, I couldn't terminate anything as I could only reboot. I had the issue both on Fedora 30 and Ubuntu 18.04.

This version is the one that blew up:

felix@gemini:~/test/rclone-v1.49.4-linux-amd64$ ./rclone version
rclone v1.49.4
- os/arch: linux/amd64
- go version: go1.13.1

I need to setup a small compute VM so I can do some additional testing.

VBB · September 30, 2019, 9:06pm

I had the same issue last night after updating to the latest available rclone beta. Started the mount around midnight, fired off a Plex scan and went to bed. This morning I noticed that Plex was still scanning, so I checked the rclone mount, and it had the same errors as Animosity's. This all started around 2am in my case, so about 1.5 hours into my scan.

Funny thing is, I had not only updated my version of rlcone last night, as I usually do every few days, but I had also changed some mount settings to reflect Animosity's recent changes. So, before coming upon this post, I assumed it was the newly added "--poll-interval 15s" causing this. BTW Ani, why did you add that instead of the default 1m?

Animosity022 · September 30, 2019, 9:10pm

I keep a good log of all my changes and once I noticed the issue, I assumed it was a setting, rolled almost all of those back to defaults. Assumed it was OS. Reinstalled a Fedora 30 and an Ubuntu 18.04.

Once I realized it was exactly the time the upgrade, I rolled back rclone on the same Ubuntu 18.04 that was hung on a Movie scan, and it's working now. So I'm certain the .3 -> .4 was the culprit.

I added the 15 seconds just because I can as I'm the only consumer of my key and it helps reduce some lag on uploads by noticed and we have plenty of API calls to spare. I was going to make it 10 probably

VBB · September 30, 2019, 9:17pm

You need to announce these changes more prominently to your followers. I always have to dig deep into your GitHub to see all the things you changed, and then I still have to come back here to ask why

Jokes aside, I noticed in your mount above you omitted most of what you had added/changed recently. I don't see the poll interval or the attr timeout change, also looks like you got rid of your recent buffer change from 1G to 128M altogether. Comment?

Sorry for getting off-topic. Just curious if and what could have caused these errors, since Nick said there were no changes that would have affected anything.

VBB · September 30, 2019, 9:26pm

BTW, this was the first time for me in almost two years that I got this error, which I assumed was caused by hitting the API limit.

I haven't rolled back rclone yet, so I'm currently uploading with 1.49.4-112, and for some reason my upload speed (limited to 70M by using --bwlimit) is cut in half today. Only getting about 35M while uploading 87 large files simultaneously. Normally, I'd hit the limit within seconds and stay there til the last few files. Just thought I'd mention that in case it has anything to do with the above.

thestigma · September 30, 2019, 9:43pm

Keeping a drive going here under heavy load (rescanning a bunch of torrents + uploading a large amount of data). I'm not really able to easily test massive multithreading over the mount as I don't really use any software that does that via the OS mount.

Nothing yet.
But come to think of it I don't think I ever used straight 1.49.4 - I believe I upgraded straight from some version of 1.49.3 to what was the latest beta at the time:

rclone v1.49.4-107-g74f63008-beta

os/arch: windows/amd64
go version: go1.13.1

showkubii · October 1, 2019, 4:19am

xpenology

I have same problem on rclone v1.49.4 mount problem.

I downgrade rclone to v1.49.3 mount working.

ncw · October 1, 2019, 7:52am

Here are the changes from 1.49.3 to 1.49.4.

Perhaps the major change is the change from go1.12 to go1.13.

I think the only conceivable bit of code which could have broken it is the accounting.

$ git diff --stat v1.49.3..v1.49.4  | grep \.go
 backend/ftp/ftp.go                                 |  10 +-
 backend/s3/s3.go                                   | 110 +++++----
 backend/s3/s3_test.go                              |   5 +-
 cmd/rcd/rcd.go                                     |  12 +-
 fs/accounting/accounting.go                        |  11 +-
 fs/accounting/accounting_test.go                   |  33 ++-
 fs/accounting/transfer.go                          |   3 +
 fs/version.go                                      |   2 +-
 lib/oauthutil/oauthutil.go                         |  29 ++-

$ rclone-v1.49.3 version
rclone v1.49.3
- os/arch: linux/amd64
- go version: go1.12.9

$ rclone-v1.49.4 version
rclone v1.49.4
- os/arch: linux/amd64
- go version: go1.13.1

Things to try to narrow this down

disable http2. The error message is from http2 so let's try disabling it

export GODEBUG=http2client=0  # disable HTTP/2 client support

1.49.4 compiled with go1.12

https://pub.rclone.org/rclone-v1.49.4-go1.12.6.zip

1.49.3 compiled with go1.13

https://pub.rclone.org/rclone-v1.49.3-go1.13.zip

If any of those tests are successful then that will give us something to dig on.

Ideally I'd like something I can reproduce so which doesn't involve plex if anyone has any ideas!

Animosity022 · October 1, 2019, 10:04am

I've been trying to reproduce and my Google Compute VM seemed to work ok in my short tests.

The only thing that Plex does is that it basically does a lot of mediainfo/ffprobe on files at the same time.

I'm trying to get a way to produce the issue without killing my home system

thestigma · October 1, 2019, 3:28pm

I managed to reproduce the issue on my own setup now:

2019/10/01 17:05:49 ERROR : IO error: open file failed: Get https://www.googleapis.com/drive/v3/files/1okKnNST30tKMZhmuA2g2CF_2QT0TuKlP?alt=media: stream error: stream ID 693; INTERNAL_ERROR
2019/10/01 17:05:49 INFO : Google drive root 'Crypt1': Failed to get StartPageToken: Get https://www.googleapis.com/drive/v3/changes/startPageToken?alt=json&prettyPrint=false&supportsAllDrives=true: stream error: stream ID 695; INTERNAL_ERROR

I ran a consistent heavy load, as described earlier, overnight. It ran fine for many hours before I went to bed - and as far as I see from the log (casually scanning it, not super thorough) it didn't manifest the problem until late in the afternoon next day. When it did though it stalled pretty bad for a long time. I ended up restarting the mount eventually.

I guess the question is:
Is this a temporary server-side problem at Google or an rclone bug in the new version?
I guess both are possible.

@ncw Please suggest how to proceed in terms of testing. What would be most useful to you? (include spesific details of logging-level if relevant).

You don't suspect any risk of data-damage do you? I assume you just meant you don't want to compromise the reliability of your main system while you try to trigger the error - but better safe than sorry, so I might as well ask

(Wife aggro is definitely a valid reason )

Animosity022 · October 1, 2019, 3:52pm

I reproduced the bug constantly for ~2 days and it immediately went away when I downgraded versions. It could be a coincidence but man, that's some bad luck if it is.

Since I've rescanned a full library a few times and 0 issues. I've never seen that error before dating back to June in my logs.

djsam · October 1, 2019, 4:00pm

Same here using two servers
It went away after downgrading

thestigma · October 1, 2019, 4:05pm

I agree that it seems like that based on your experience, but I think we don't really have the statistical foundation to call it conclusive. If it's a Google problem it could come and go "randomly" and that makes it very hard to verify short of testing a LOT and having all the results be consistent with each-other. We've seen more than a few times suspicions of new bug pop up that with some time vanish into thin air (indicating a service error was the reason).

I definitely want to get to the bottom of this though. Let's assume it's a bug until we can reasonably verify otherwise.

I am running similar testing again now to see if I can get the same problem and note any pattern.