OneDrive: rclone getting throttled and pacer sleep time skyrockets to >5m

What is the problem you are having with rclone?

Performing a sync operation with many small files (in my test it was around 10k files, for around 200MB total size, which is actually not that many) randomly causes rclone getting throttled with the pacer unpredictably selecting a huge sleep time that hangs the operation for several minutes. In the log below you'll notice the pacer first selecting a 5minutes sleep time, then very slowly decreasing it to an acceptable rate. In this scenario a synchronization can appear to be dead for up to 20 minutes (without going timeout). Killing the operation and restarting it right after can randomly resume the operation with acceptable transaction rate, or more often it will cause rclone to hang similarly. I also tried setting the user-agent to --user-agent "ISV|rclone.org|rclone/v1.56.2" without improvements.

What is your expected behavior?

rclone should either behave so it will get throttled less or the pacer should not select these (nonsense?) huge sleep times.

What is your rclone version (output from rclone version)

1.56.2

Which OS you are using and how many bits (eg Windows 7, 64 bit)

ubuntu 18.4.3 64bit

Which cloud storage system are you using? (eg Google Drive)

OneDrive for Business (Microsoft 365 Business Basic)

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone syn  -vv "--exclude=cache/" "--exclude=backup/" /srv/vtiger/ onedriveitr:/BackupRUR/vtiger/

The rclone config contents with secrets removed.

[onedriveitr]
type = onedrive
token = [..]
drive_id = [...]
drive_type = business

A log from the command with the -vv flag

2021-10-02 18:55:38 DEBUG : logs/migration.log: Unchanged skipping
2021-10-02 18:55:38 DEBUG : logs/platform.log: Size and modification time the same (differ by -80.409291ms, within tolerance 1s)
2021-10-02 18:55:38 DEBUG : logs/security.log: Size and modification time the same (differ by -76.409309ms, within tolerance 1s)
2021-10-02 18:55:38 DEBUG : logs/sqltime.log: Size and modification time the same (differ by -80.409291ms, within tolerance 1s)
2021-10-02 18:55:38 DEBUG : logs/.htaccess: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:55:38 DEBUG : logs/platform.log: Unchanged skipping
2021-10-02 18:55:38 DEBUG : logs/security.log: Unchanged skipping
2021-10-02 18:55:38 DEBUG : logs/sqltime.log: Unchanged skipping
2021-10-02 18:55:38 DEBUG : logs/.htaccess: Unchanged skipping
2021-10-02 18:55:38 DEBUG : pacer: Reducing sleep to 18.3515625s
2021-10-02 18:55:38 DEBUG : languages/WSAPP.php: Size and modification time the same (differ by -156.786508ms, within tolerance 1s)
2021-10-02 18:55:38 DEBUG : languages/WSAPP.php: Unchanged skipping
2021-10-02 18:55:39 DEBUG : pacer: Reducing sleep to 13.763671875s
2021-10-02 18:55:39 DEBUG : pacer: Reducing sleep to 10.322753906s
2021-10-02 18:55:39 DEBUG : kcfinder/browse.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:55:39 DEBUG : kcfinder/browse.php: Unchanged skipping
2021-10-02 18:55:39 DEBUG : kcfinder/css.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:55:39 DEBUG : kcfinder/css.php: Unchanged skipping
2021-10-02 18:55:39 DEBUG : kcfinder/js_localize.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:55:39 DEBUG : kcfinder/js_localize.php: Unchanged skipping
2021-10-02 18:55:39 DEBUG : kcfinder/config.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:55:39 DEBUG : kcfinder/config.php: Unchanged skipping
2021-10-02 18:55:39 DEBUG : kcfinder/upload.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:55:39 DEBUG : kcfinder/upload.php: Unchanged skipping
2021-10-02 18:55:39 DEBUG : pacer: Reducing sleep to 7.742065429s
2021-10-02 18:56:31 DEBUG : pacer: Reducing sleep to 5.806549071s
2021-10-02 18:56:38 DEBUG : pacer: Reducing sleep to 4.354911803s
2021-10-02 18:56:44 DEBUG : pacer: Reducing sleep to 3.266183852s
2021-10-02 18:56:48 DEBUG : Too many requests. Trying again in 322 seconds.
2021-10-02 18:56:48 DEBUG : pacer: low level retry 1/10 (error activityLimitReached: throttledRequest: The request has been throttled)
2021-10-02 18:56:48 DEBUG : pacer: Rate limited, increasing sleep to 5m22s
2021-10-02 18:56:52 DEBUG : pacer: Reducing sleep to 4m1.5s
2021-10-02 18:56:52 DEBUG : resources/app.js: Size and modification time the same (differ by -327.730187ms, within tolerance 1s)
2021-10-02 18:56:52 DEBUG : resources/app.js: Unchanged skipping
2021-10-02 18:56:52 DEBUG : resources/Connector.js: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:56:52 DEBUG : resources/Connector.js: Unchanged skipping
2021-10-02 18:56:52 DEBUG : resources/helper.js: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:56:52 DEBUG : resources/helper.js: Unchanged skipping
2021-10-02 18:56:52 DEBUG : resources/jquery.additions.js: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:56:52 DEBUG : resources/jquery.additions.js: Unchanged skipping
2021-10-02 18:56:52 DEBUG : resources/styles.css: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:56:52 DEBUG : resources/ProgressIndicator.js: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 18:56:52 DEBUG : resources/ProgressIndicator.js: Unchanged skipping
2021-10-02 18:56:52 DEBUG : resources/styles.css: Unchanged skipping
2021-10-02 19:02:14 DEBUG : pacer: Reducing sleep to 3m1.125s
2021-10-02 19:06:15 DEBUG : pacer: Reducing sleep to 2m15.84375s
2021-10-02 19:09:16 DEBUG : pacer: Reducing sleep to 1m41.8828125s
2021-10-02 19:09:16 DEBUG : schema/DatabaseSchema.xml: Size and modification time the same (differ by -331.730158ms, within tolerance 1s)
2021-10-02 19:09:16 DEBUG : schema/DatabaseSchema.xml: Unchanged skipping
2021-10-02 19:11:32 DEBUG : pacer: Reducing sleep to 1m16.412109375s
2021-10-02 19:11:32 DEBUG : cron/language/phpmailer.lang-en.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 19:11:32 DEBUG : cron/language/phpmailer.lang-en.php: Unchanged skipping
2021-10-02 19:13:14 DEBUG : pacer: Reducing sleep to 57.309082031s
2021-10-02 19:13:14 DEBUG : includes/exceptions/AppException.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 19:13:14 DEBUG : includes/exceptions/AppException.php: Unchanged skipping
2021-10-02 19:14:31 DEBUG : pacer: Reducing sleep to 42.981811523s
2021-10-02 19:14:31 DEBUG : include/database/PearDatabase.php: Size and modification time the same (differ by -601.793556ms, within tolerance 1s)
2021-10-02 19:14:31 DEBUG : include/database/PearDatabase.php: Unchanged skipping
2021-10-02 19:14:31 DEBUG : include/database/Postgres8.php: Size and modification time the same (differ by 0s, within tolerance 1s)
2021-10-02 19:14:31 DEBUG : include/database/Postgres8.php: Unchanged skipping
2021-10-02 19:15:28 DEBUG : pacer: Reducing sleep to 32.236358642s
Transferred:              0 / 0 Byte, -, 0 Byte/s, ETA -
Checks:                95 / 95, 100%
Elapsed time:     20m26.3s

I think that's a known issue as well:

hello and welcome to the forum,

as per the documentation,
https://rclone.org/onedrive/#getting-your-own-client-id-and-key
"The default Client ID and Key is shared by all rclone users when performing requests."
"If you are having problems with them (E.g., seeing a lot of throttling), you can get your own Client ID"

Hi Francesco,

The sleep times you see are due to rclone respecting the Retry-After time returned in the HTTP 429 (Too many requests) responses from OneDrive. There is no way to avoid them for sure, they depend on many parameters including the current load on the OneDrive servers hosting your data.

You can see it being received and handled in this line of your log:

2021-10-02 18:56:48 DEBUG : Too many requests. Trying again in 322 seconds.

You can however find some tips and experiences to reduce the throttling in this thread:

And read more about OneDrive throttling in this article:

On a side note to illustrate the sporadic nature of the OneDrive throttling: I am looking at ways to improve the pacer and my current problem is that I am currently unable to provoke the HTTP 429 responses, despite very heavy load (disrespecting all of the recommendations in the above thread).

Thinking about it, perhaps I can use your situation to provoke the HTTP 429’s.

What are the characteristics of your data? Is it something that I can recreate - e.g. a program or git repository I can download?

What other activities did you perform on your OneDrive account in the 24 hour period before performing the test?

hi,

i have no issues triggering the HTTP/2.0 429 Too Many Requests, even on a sunday morning.

  • use personal not business
  • no client id
  • rclone sync --transfers=100 --checkers=100 of 1,000,000 local files, each of size 1 byte.

i also get HTTP/2.0 503 Service Unavailable with a decode error.

2021/10/03 11:33:47 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/10/03 11:33:47 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/10/03 11:33:47 DEBUG : HTTP RESPONSE (req 0xc01c2c6300)
2021/10/03 11:33:47 DEBUG : HTTP/2.0 503 Service Unavailable
Content-Length: 974
Cache-Control: no-store
Content-Type: text/html
Date: Sun, 03 Oct 2021 15:33:46 GMT
X-Azure-Externalerror: 0x80072efe,OriginConnectionAborted
X-Msedge-Ref: Ref A: DE518F3B63BD40199EFA7A26C6632BC8 Ref B: EWR30EDGE0917 Ref C: 2021-10-03T15:33:28Z

2021/10/03 11:33:47 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/10/03 11:33:47 DEBUG : Couldn't decode error response: invalid character '<' looking for beginning of value
2021/10/03 11:33:47 DEBUG : pacer: low level retry 1/10 (error 503 Service Unavailable: )
2021/10/03 11:33:47 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2021/10/03 11:33:47 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Thanks @asdffdsa Good to know, I will try that.

I gave up after 280,000 files totaling 13GB without seeing a single 429, seems like I should have had the persistence to quadruple that.

I also see the occasional 503’s - they happen somewhat randomly and causes no real issues in my tests.

The decode error is due to OneDrive returning <nil> instead of a JSON value. I doesn’t seem to be a problem, because the JSON part of the response is ignored later based on the 503 response code.

Thank you for the prompt reply. I first saw it happen while doing an upload test with an entire old copy of the vtiger crm php site that you can download from the following link:
https://sourceforge.net/projects/vtigercrm/files/vtiger%20CRM%206.1.0/Core%20Product/vtigercrm6.1.0.tar.gz/download

After I noticed it when doing backups of gitlab installation data, which includes a lot of git repositories. Basically bare workset data or application assets, composed of small files, seem to trigger the throttling very easily in my scenario.

In the 24 hours before I began testing rclone uploads I was doing nothing special: I was testing an alternative onedrive client[1], downloading/uploading a small amount of data from the same account.

I will attempt to retry the test tomorrow following your suggestions and see if it changes anything.

[1] GitHub - abraunegg/onedrive: #1 Free OneDrive Client for Linux

Thanks, good info!

I am curious to see the location of your OneDrive datacentre. Can you perform this command:

 rclone lsd onedriveitr: -vv --dump headers

And then just post the first line from the log looking like this:

X-Ms-Ags-Diagnostic: {"ServerInfo":{"DataCenter":"West Europe","Slice":"E","Ring":"5","ScaleUnit":"002","RoleInstance":"AM2PEPF00007A48"}}

A quick peek in the code of the abraunegg onedrive client indicates that it handles 429 (and 503) somewhat similar to rclone - and it seems to report them in the log.

If you see the 429 (Too many requests) responses in rlcone again tomorrow, then it would be interesting to know if they are also reported by the abraunegg client - and result in similar slow downs.

Sure, I have two, I will post both in order of appearance:

X-Ms-Ags-Diagnostic: {"ServerInfo":{"DataCenter":"North Europe","Slice":"E","Ring":"4","ScaleUnit":"003","RoleInstance":"DB1PEPF00005962"}}
X-Ms-Ags-Diagnostic: {"ServerInfo":{"DataCenter":"North Europe","Slice":"E","Ring":"4","ScaleUnit":"002","RoleInstance":"DU2PEPF000070BF"}}

Still related to the topic, I'm trying to resync a location which has not changed, both locally and remotely and I have a question. Obviously in this scenario no real data transfer should occur and only metadata with checksums/timestamps/sizes should be exchanged. Still, in my tests, I'm easily throttled as well. Visually by the observing the verbose log it seems rclone is querying the remote state of each file one by one. Is this interpretation correct or rclone attempts to group the queries of several files and reduce API calls? Sorry if I'm doing wild guesses/wrong interpretations.

Don't know the abarunegg client, but it may have scanned you entire OneDrive and repeated that in fixed intervals. If you have many files in your OneDrive then that may have triggered the throttling (or brought it near before you tested rclone). I therefore suggest you unmount it during the cool down (the next 24 hours) until trying rclone.

I am pretty sure it only lists the folders, but they may also be very demanding if your have many folders. Each folder will require a request. It can sometimes be optimized by --fast-list. But what you are seeing right now can't be explained by any of the this. I would like to get the abraunegg client out of the equation and your OneDrive cooled down (min. 24 hours) to get a clear picture.

Doesn’t RClone do the equivalent of MD5 for each file in question somehow? With S3, which I know you aren’t using, I think it gets this from meta data somehow.

Forgetting RClone, I have gotten bitten multiple times with throttling by Google so agree on removing as many differences as possible.

This is probably a real bad idea, but should we give RClone users the ability to specific the longest sleep time they want when getting throttled. Obviously, with strong warnings that setting this too low could cause an apparent DOS attack and might get the user’s Google account locked and/or terminated. Just don’t allow a time less than some predefined number you all use.

I'd probably keep the thread to OneDrive rather than trying to mix things up as OneDrive's throttling is a bit nastier.

You can already do that with Google Drive there is an exponential back off algorithm (I'm sure that's for other providers as well but I only use Google myself). It's best to just leave the defaults as is and all what you've stated is already taken care of.

You don't get your account removed for hitting the API too hard as you just get 403s and you'll get rate limited / asked to back off.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

First, sorry for the very late reply. I had to request re-opening of the post to be able to reply here, since it was automatically closed, but I really wanted to post my final feedback and keep it close to the original report. To summarize: I had these OneDrive sync operations that triggered throttling very easily, and the amount of throttling was unsustainable (>5 minutes). To add one more detail that maybe was not better specified, I'm using a OneDrive for Business account and the company uses some SharePoint features, such as a SharePoint drive/link added in the root of "My Files" as a shortcut, but I'm not currently writing/reading from this folder. Still, looking to other feedbacks it seems throttling is often linked to use SharePoint features. Following the suggestions of commenters and rclone guides I:

  • Set up correctly a client ID and key as explained in the guide;
  • Attempted to create different user agent combinations, such as ISV vs NONISV, AppName(s) as explained by Microsoft. Still Microsoft doesn't explain very clearly what these attributes should be.

After all these attempts and upgrade to latest 1.57 version of rclone, I was still experiencing severe throttling, confirming that the issue is reproducible in my scenario and not related to use of other unofficial clients. Fortunately, I stumbled upon this post, that basically reduces the amount of concurrent transactions, and the overall rate of them. In particular I tried various combinations of --checkers. I tried to reduce it to 4, 2 and I always was able to reproduce the throttling. Finally I set it to 1 and the throttling finally stopped arising. The bandwidth performance is greatly reduced but at least the connection doesn't unpredictably hang anymore.

In conclusion, in my scenario (which may be a combination of the type of account, the exact infrastructure and kind of the cloud service being used by the company, the type of the data being synced, the use of some particular European regional server and other variables) any use of concurrent/parallel transactions decided by the --checkers options in rclone is strictly forbidden, no matter what else I do to better decorate the connection. I still have the feeling something is wrong in rclone with regard to OneDrive number of generated transactions (which seems to be huge and not optimal in case of synchronization which huge number of flies and folders), but I'm satisfied enough with the result achieved. I would love to hear if there are updates with regard to better handling of throttling/reduction of number of transactions in rclone with OneDrive, even if I can't promise I will be able to test them promptly, similarly to the delay in posting my feedback with regard to the synchronization task I was implementing.

2 Likes

Thank you, your observations are very much appreciated and very helpful :blush:

I have since been able to provoke a similar behavior on my OneDrive Personal Premium and can confirm that I also didn’t see an effect of personal Client ID or user agent.

It is my observation that the message “429/Too many requests” really means “too many requests waiting in queue at the server” and the solution is to gradually reduce the number of parallel connections/requests used by rclone until it fits the current wishes of the server (think a dynamic reduction of --checkers and --transfers).

I typically see a stabilization when connections have been reduced to 12 which corresponds to the current defaults (--checkers=8 --transfers=4). Your observations clearly shows that there isn’t just a single optimum and sometimes the connections/checkers need to be reduced to minimum (and stay there). This is very good to know!

I have an rclone prototype that can do most of this and some other pacing optimizations, but it still need some work to be ready for beta test. Timeframe: I am struggling to find time just like you.

I know that it sometimes feels slow, and is painfully slow in your situation, but are you able to pinpoint why you have a feeling that rclone performs too many requests/transactions?

Have you
• tried --fast-list?
• traced the rclone communication with --dump bodies?
• experienced a sync tool that did the same, but faster?

The important thing to understand about OneDrive, Google Drive and similar is that they are priced per size only and therefore come with limitations to the speed and number of requests you can do within an hour/day (that is fair usage limitations that are enforced by some kind of rate limiting or throttling).

If you want unlimited speed and up/download capacity, then you typically get to pay per request (one way or the other) - think Microsoft Azure or similar.

did a quick test with wasabi.
rclone sync d:\files\10000x20MB wasabi01:backupdirtest/10000x20MB --checkers=256 --transfers=256

Transferred:   	  200.005 MiB / 200.005 MiB, 100%, 40.002 MiB/s, ETA 0s
Transferred:        10000 / 10000, 100%
Elapsed time:         5.9s

This is just a feeling because you explained me that rclone has (really?) no better way to sync the whole tree than querying recursively (and concurrently) for folders content. Really there's not a single API to download the whole index starting from a location, with the server optimizing the transfer as much as possible (file names are highly compressible) and that is free of throttling by definition because the server also decides a paging strategy? Sure: this single API call can't be made parallel, but it seems to me a better strategy for me for this kind of unidirectional synchronizations. When the index is fully downloaded, then you can do any amount of needed transfers concurrently. This is just a speculation of mines: you (rclone devs) may already have decided this is not a good strategy because of how modern clouds work. But as a matter of fact I notice throttling also when syncing unchanged trees, where the amount of actual file data transfers is (and will be) 0 bytes.

Have you
• tried --fast-list?

This was one of the first tries I did and didn't seem to help me. Sorry, I haven't tried it more recently.

• traced the rclone communication with --dump bodies?

Sorry, no. Should I?

• experienced a sync tool that did the same, but faster?

When syncing some large whole tree accounts from scratch with the official OneDrive client in the company don't seems to hang the way rclone did for me, but I can't inspect them the same way, so I can't say for true.

This may help. Reducing dynamically (and gradually) the concurrent transfers to as low as 1 (no concurrent transfers) as a result of being throttled may be a key requirement in surviving some aggressive OneDrive throttling policies.