Using --files-from with Drive hammers the API


#1

Since upgrading to rclone 1.46 my automated jobs have all gotten stuck with endless 403 errors when trying to upload to Drive. The API console consistently shows queries per 100 seconds approaching 1000 from a single instance of rclone trying to upload a few hundred files to a very large Drive.

I dug into the problem and found that the jobs that hung all used the --files-from switch.

Steps to reproduce:

/mnt/media/Plex contains 646 files of varying size, from a few KB to several GB.
~/dl/filelist is a list of those files
Drive is a crypt drive backed by a Google Drive

rclone copy -vvP --files-from ~/dl/filelist /mnt/media/Plex/ Drive:

Result: rclone stalls forever at 0 kb/s throwing constant 403 rate limit errors and eventually hits the daily cap leading to a 24 hour ban. It never even checks any files.

Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors:                 0
Checks:                 0 / 0, -
Transferred:            0 / 0, -

Pointing rclone directly at the local directory:

rclone copy -vvP /mnt/media/Plex/ Drive:

Result: rclone immediately starts checking files, copies everything normally with very few 403 errors.

I have reproduced this bug on two different machines, both running rclone 1.46 release and 16.04.1-Ubuntu.

Related post.


#2

I just checked and the bug goes back to v1.45.
I had to roll back to v1.44 to get --files-from working properly with Drive.

Apparently I had skipped over the 1.45 update!


#3

I suspect this is the problem since was released in v1.45

I’m not sure why it would give you a problem though…

How many files do you typically have in ~/dl/filelist?

Can show a log with -vv of the problem?


#4

I pasted some representative logs in another thread.

The number of files varies considerably. Sometimes it is hundreds or thousands of tiny files with some big ones mixed in, other times it is fewer than a dozen, mostly large files. It seems like the issue scales with the number of files because I only noticed it when I found a background job that had hung since November (!), which prompted me to update rclone and try to clear out the backlog, which had grown to thousands of files.

What it appears is happening is that rclone is taking some steps before it even gets to checking files that causes a huge spike in API calls and then it ends up spending most of its time throttling API calls. I have left it running for upwards of 9 hours with no change. The log looks pretty much exactly like what I pasted in the other thread, over and over and over, though sometimes a few files might actually transfer before it freezes again.

It feels to me like --files-from is causing it to loop over each file and perform some kind of API heavy operation that does not happen otherwise.

Adding --tpslimit has no effect, but using --no-traverse results in the checking / transfers starting, but with extremely low (single kb/s) upload speeds.

Switching between --files-from and just pointing it right at the directory is like magic; the latter fires up and starts checking files immediately and begins uploading right away.


#5

It says in the code comments

// If --files-from is set then a DirTree will be constructed with just
// those files in and then walked with WalkR

Does that mean that it maps the local path from each item in --files-from on to a remote directory and then lists the contents of that directory recursively?

It is not uncommon that the list of input files spans hundreds of different sub-directories. As I understand it, from reading about --fast-list, rclone can list the entire directory structure with relatively few API calls.

Could there be a situation in which it requires far more API calls to list each of hundreds of sub-directories than just to get a list of the entire remote?


#6

What --files-from does is find each file individually. This will typically involve getting the parent directories in the directory cache first, but more or less it will take slightly more than 1 API call per file.

If you’ve got lots of files in the --files-from list then these API calls can add up.

So to answer your question, yes it is more than likely that at some point --files-from will be slower than doing a directory list. In v1.44 that is exactly what it does but in 1.45 I switched to finding the files individually as this makes it faster in the common case of copying just a few files into a very large list of files.

It might be that there should be a flag to control that behavior or a heuristic.

At the point that --files-from become slower than just doing a straight copy then you should probably just do a straight copy.

From my testing, google really doesn’t like the transactions rclone is doing for --files-from, I don’t know why.


#7

I can definitely attest that Google does not like whatever rclone is doing!

But what you describe is, indeed, exactly my problem; I build large lists of files and then use separate calls to rclone to send the files to various endpoints. I have no problem with SSH endpoints, only Drive. The problem only cropped up because a particularly large set of files showed up, which caused a back-log, only exacerbating the problem.

The reason that I have to do it this way is that I send the list of files to a remote server that needs to know exactly what rclone has synced and that is the easiest way to ensure that a mistake isn’t made when a new file pops into the source directory during one of the copy operations.

I realize that I am probably an edge case, but it would be great to have a switch to have --files-from operate exactly as it does when I point rclone directly at the source directory with --exclude filters.


#8

I made a test with a flag called --files-from-traverse which when set will do the directory traversal like before. I tried it with drive and it worked well!

https://beta.rclone.org/branch/v1.46.0-019-gf97cbd3b-fix-files-from-traverse-beta/ (uploaded in 15-30 mins)

Can you let me know if it works for you? And maybe a better suggestion for the name other than --files-from-traverse :wink:


#9

Wow, that was quick!

I installed it on the server. I’ll keep an eye on it and let you know how it works.


#10

It seems to be working exactly as 1.44 did — transfers started up right away using --files-from, --files-from-traverse and --fast-list together.

Perhaps the switch should reflect the usage case. For instance, rclone could automatically use the old traverse method when the list of inputs is greater that 10 and then you could have a switch to change that threshold, like --files-from-threshold. That way you could set a reasonable cut-off in a script and get the best of both worlds; fast transfers when the list is small, and not choking the API when the list is large.


#11

Great :slight_smile:

That is a reasonable idea!

It might be possible to do it automatically…

Rclone could work out the number of distinct directories referenced in the files from list - call it D with the number of files N

For google drive:
Expected API calls without traversing = N + P
Expected API calls with traversing = D + P

Where P is the number of API calls it takes to find the IDs of the parents…

So if N < D, ie some files are in the same directory then traverse.

By this logic rclone should always traverse (except for N==D in which case the API calls are the same, so might as well traverse there too), which I don’t think is correct! Perhaps rclone should use a heuristic like if N > 0.5*D then traverse.

This doesn’t take into account the data transfer times - transferring listings are bigger than just looking at one file.

I think the google drive case is pathological for some reason though - the only way of finding individual objects is by effectively doing a query on a directory listing for the object name. Google will see that as no different to doing a directory listing so it is rate limiting the directory listings.


#12

That would, of course, be awesome, but it would still be nice to have a manual override because Google seems to make changes without warning that completely break things that had been working without a hiccup for months.

If I understand the logic correctly, that would make sense because it is not so much about the absolute number of API calls as it is doing things quickly with relatively small numbers of files while ensuring that they actually get done when there is a sufficiently large set of files to go over the API cap — balancing speed and reliability.

For my use-case data transfer times are not particularly important, but reliability is. This whole issue cropped up because rclone stalled in a cron job in the background, leading to a backlog of files in the local cache, which further increased the number of files that needed to be uploaded, compounding the problem.

The cron job will send me a push notification if rclone exits with an error, but the particular way that it failed was tricky since rclone never hit any timeout limit on a connection. It literally sat in the background for two months, transferring no data because my script is not set up to worry about how long it runs. It seems like rclone should also figure out to exit with an error when it gets in an un-winnable fight with the API limiter.