Hitting 403 user rate limit, but why? Gdrive 100G blob limit ban?

thestigma · August 2, 2019, 11:48pm

Hey all,

Recently ran into this issue when running sync between 2 of my completely unrelated Gdrives:
2019-08-02 19:22:51 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: User rate limit exceeded., userRateLimitExceeded)

Command:
rclone sync TD2: TD3: -P --fast-list --track-renames --v
(this is with server_side_across_configs = true enabled, this had been working perfectly up until this point)

I'm certain I didn't hit the 750GB/day limit also as I was aware of this. Besides this would produce a different error right? It also hit 2 different drives at about he same time, even though I had barely used one of them for days.

Also, I am certain this is not API rate limiting because after extensive testing I start getting ratelimited either immediately or almost immediately even when setting extremely low --transfers, --checkers, --tpslimit 1 and even messing with forcing the drive pacer lower (not sure if this is the same as tpslimit technically). Google API metrics also indicate that I am nowhere close to maxing out. I barely spike to 4 requests a second at worst when not using extreme tpslimit and even less when I do. These drives now seem almost completely locked down. I can still seem to get listings from them and even transfer a file or two on occasion if I leave it running, but it's operating at like 1 request pr 10 seconds or something... unusable.

And yes, I am using my own Oauth. A separate one for each of the drives I use, so there is no shared quota. Nor have the drives in question been used by anything else than the sync operation.

So after much more research I come across this which I now strongly suspect is the cause of the problem. Supposedly this is a quote given by some google rep at some point (taken from an old and very long related thread):

Also, in June 2017 a quota for creating blobs of 100Gb/day was established. It’s possible to create files of bigger size, but after this quota is exceeded all subsequent blob creation operations will fail.

After reading this I of course checked to see if I had any files above 100GB, and indeed I did on the affected drives. 2 in fact (but they almost certainly were transferring at the same time, which probably explains it). It also makes sense to me that the problem started approximately at or shortly after these got transferred. So if this hypothesis is correct then >100G files are basically poison to Gdrives that will render them useless for the next 24h(?). This seems like it should be avoided at all costs for obvious reasons. A very curious limit given the upload max is 750G - but who am I to question the infinite wisdom of google...

This leaves me to a few questions:

- Can anyone else please confirm if this is an actual limitation?
- Assuming that it is, should this not be mentioned somewhere in the Drive backend documentation? It seems to me like it would make sense to also mention the 750G upload limit (I think I have even seen this officially stated as a limit at this point besides it being experimentally verified by several users). I understand the hesitation to put info in the docs which aren't hard verified info, or even a limit in rclone itself, but it seems like it would be very useful information for users to have because this is one of those things that you will otherwise need to do a lot of searching to find out about. I don't expect the average user would know how to debug this problem themselves.
- Shouldn't be we have some mechanism to either block or deal with >100G files to prevent this from happening and causing grief to users why don't understand why things suddenly stop working? The ideal solution would obviously be to split the files somehow, but as a stopgap measure, even throwing an error or something seems appropriate. Even if you are aware of this limitation it can be easy to trip by mistake when handling large folder structures.

@ncw Would appreciate your thoughts on this, and tell me if you want an issue for it.

Config details below for completeness, even though I think it's not relevant:

[TD2]
type = drive
client_id = [REDACTED]
client_secret = [REDACTED]
scope = drive
token = [REDACTED]
team_drive = [REDACTED]
upload_cutoff = 256M
chunk_size = 256M
server_side_across_configs = true

[TD3]
type = drive
client_id = [REDACTED]
client_secret = [REDACTED]
scope = drive
token = [REDACTED]
team_drive = [REDACTED]
upload_cutoff = 256M
chunk_size = 256M
server_side_across_configs = true

Animosity022 · August 2, 2019, 11:51pm

Is that your only log with that error? Using the defaults, you probably hit more than 10 transactions a second as that's a normal error to see.

Did the rest of the log continue or did it error out?

thestigma · August 2, 2019, 11:57pm

No, I actually have a full log here. I just shortened it because it didn't seem relevant. It's just the same problem repeating forever. I will post a full log below (this particular attempt never even got past listing the drive, although that seem to eventually happen if you leave it running a really long time)

I am 99,9% sure this is not an API limit, as explained in the post.

This and other tests did not error out. I stopped them manually after it's obviously not working. I did leave other tests running far longer than this to be 100% sure it's not inside the same 100second interval. This happens no matter what - even if I waited 10 minutes between attempts.

Log:
https://pastebin.com/4hcyVJv0

Animosity022 · August 3, 2019, 12:04am

It's relevant as it shows what the error could be. A single 403 rate limit is not an issue and that's all you posted.

Sharing the whole, you can see you 10 in a row and transferred nothing, which is a bit different.

Is there a reason you are using a beta build ?

thestigma · August 3, 2019, 12:10am

No other reason other than to have the latest fixes. It's no special branch, just the "latest beta" main build.
I did try 2 versions as part of my initial troubleshooting, but they were both betas. Went from 1.48-82 to 1.48-103 . No difference at all.

I must note that this server-side copying was working fine over several days (not continously as I'm aware of 750G limit) without any issue. The problem suddenly started after these large files.

I could try 1.48 stable if you feel that's relevant.

Right now my immediate plan though is to get these files off my drives and try again tomorrow once the "ban" expires. Since I won't be able to easy test more after that, I will give 1.48 stable a go before I remove them.

Animosity022 · August 3, 2019, 12:12am

The log has 0 files checked or transferred though.

The beta is unreleased so normally, the stable version would be the best to use as a beta, is well a beta version and not considered production ready.

Animosity022 · August 3, 2019, 12:12am

I'd probably check the API console and see what errors you are actually getting and on what as that would provide a direction.

thestigma · August 3, 2019, 12:17am

Of course. Never had issues with the betas, but I understand it's something we need to eliminate.

Just ran a re-test with 1.48 stable. Identical results.
Log below:
https://pastebin.com/pCnda1Qu

(I know this is TD1 and TD2 rather than TD2 and TD3. Both TD2 and 3 got hit with the same problem and act identical. I just didn't have the files on TD2 to test with any longer.)

I have had listings complete, and even a few files transfer while in this state. The drives don't seem to be completely locked down (I did for one manage to eventually remove the files), they just seem absurdly rate-limited. Something like maybe a request every 10 seconds - if even that much. It's very strange.

thestigma · August 3, 2019, 12:21am

I'm not sure how I see the errors specified by type...

But basically the graphs of the tests I showed you are 4req/sec peak, average close to 2req/sec. As soon as the operation starts it quickly ramps up towards almost 100% error the longer I leave it running.

EDIT: I figured it out.
It's all drive.file.list and drive.file.copy errors

The distribution between the two varies, which makes sense as the listing only happens at the start so it would be more and more % copy errors the longer I leave it.
I'm not trying to perform any other commands in this case (except maybe it would run into a couple of moves, but I probably never got that far).

Animosity022 · August 3, 2019, 12:32am

Hmm. Copy Errors would make me think upload limit for the day, but if you aren't seeing any listing at all and they are erroring out, that seems strange.

Are you able to just rclone ls on the source ?

thestigma · August 3, 2019, 12:32am

Have you taken note of the 100G blob limit I talked about in the original post? I don't know if you maybe skimmed it (I know it's long and I can be overly verbose) and missed that part. That seems like the obvious culprit to me as of now. Are you aware of this at all?

thestigma · August 3, 2019, 12:36am

I can
rclone ls TD2:
for example, yes.

It is however much slower than would be normal, and in the debug spam I see a lot of rate limit errors - but it does complete.

As for daily upload limits, I have been aware of them and trying to avoid them. I could maybe have been careless and let TD2 hit the quota (though I doubt it), but absolutely not TD3. I had barely transferred 150'ish GB when this hit that drive also. No way that hit quota. I can't show any hard logs of his obviously, but you will have to take my word for it =P

thestigma · August 3, 2019, 12:45am

I have the luxury of having access to a few unrelated drives. If need be then tomorrow I will sacrifice one to a potential 24h ban by uploading only a single file above 100G to an otherwise empty and unused drive just to see if I can confirm my 100G blob hypothesis beyond doubt.

It's getting late though so I will leave that for the morrow so I have all day to log that error problem - should it turn out to be repeatable.

ncw · August 3, 2019, 3:44pm

I guess we could stick some info in the drive Limitations section. A link to an official document would be great, but if we can't get one then , you are right, this will help people.

Fancy sending a PR with a patch to the docs? This is the source of all the docs: drive.md and that is the only file which needs patching.

You can block them with --max-size 100G in your syncs.

If we block them though, people won't be able to upload 100G files at all.

Perhaps a warning message?

100G is a really big file! Is it a VM image?

thestigma · August 3, 2019, 4:26pm

It's a piece of old raw footage. Not something really meant to be permanent, but in my efforts to clean up my data and reorganize I kind of just wanted to throw them on the Gdrive for now and transcode them to something more reasonable later. It's not a typical file, no.

Will a VFS actually refuse to take in such a file with --max-size 100G, or will rlcone just refuse to sync it after?

If we have to abide by a 100G limit for individual files that's not really much of an issue for me personally - as long as I know about it and/or rclone can handle it or protect from it accidentally triggering. Either a warning message or an optional flag to allow files above 100G seems appropriate. Eventually some feature to automagically split them, but that's something we can probably worry about later if we at some point can get code to do transparent in-line archiving to assist in small-file efficiency too.

So is this something you knew about NCW, or is it news to you as well?
If needed I can run a test to jam a an otherwise unused drive with a >100G file and confirm beyond doubt that this is in fact what we are seeing. If you want me to run any particular logging or something (beyond what I logged earlier in thread) as it triggers then please let me know ASAP - because I really don't want to have to do this more than once =P If it is a limitation by google then I'm not sure how useful the logs would be - but you are the best person to know that. Just say the word

thestigma · August 3, 2019, 4:32pm

Yes, if you want me to add a few words in the docs I would be happy to do that.
Not at all sure how the github workflow for a "patch" works, but I'm sure I will figure it out and I might as well learn.

ncw · August 3, 2019, 5:37pm

Ah, didn't realise you were uploading via rclone mount. --max-size will work well with sync/copy but with mount it will stop those files appearing in listings. It won't stop them being uploaded.

No it is news to me!

I'm not sure I need any logs.

I'd be willing to put a warning in - that would be very easy. If you make the issue - I'll do the code!

ncw · August 3, 2019, 5:38pm

A pull request is most convenient for me, but if you want to write stuff and paste it here, that works for me too. Writing the stuff is the hard part!

thestigma · August 3, 2019, 5:50pm

In terms of protecting the mount:

Perhaps throw an I/O exception to the OS unless user has enabled optional "allow huge files" flag.
Basically the same mechanism that prevents me from uploading files with paths that would be too long to work with encryption. You can add an explanatory error output from rclone as that happens so it's at least easy to debug.

Ultimately I trust you to know best here.

Before I make an issue I will:

Confirm my original sync operation now works after >100G files were removed (once ban ends)
Try to specifically trigger the issue on an otherwise unused drive to be 100% this is what is happening. (will not be logging this as pr. your comments - keeping it simple)

Posting experiment results here when I can...

ncw · August 5, 2019, 2:42pm

Actually there is a problem here... The mount doesn't know the length of the file until the user has uploaded it - that is the way filing systems work. So if not using --vfs-cache-mode then it will be uploaded immediately. If using --vfs-cache-mode writes it will get written to disk first. At that point the backend could throw a warning about 100G files.