Since upgrading to rclone 1.46 my automated jobs have all gotten stuck with endless 403 errors when trying to upload to Drive. The API console consistently shows queries per 100 seconds approaching 1000 from a single instance of rclone trying to upload a few hundred files to a very large Drive.
I dug into the problem and found that the jobs that hung all used the --files-from switch.
Steps to reproduce:
/mnt/media/Plex contains 646 files of varying size, from a few KB to several GB.
~/dl/filelist is a list of those files
Drive is a crypt drive backed by a Google Drive
Result: rclone stalls forever at 0 kb/s throwing constant 403 rate limit errors and eventually hits the daily cap leading to a 24 hour ban. It never even checks any files.
I pasted some representative logs in another thread.
The number of files varies considerably. Sometimes it is hundreds or thousands of tiny files with some big ones mixed in, other times it is fewer than a dozen, mostly large files. It seems like the issue scales with the number of files because I only noticed it when I found a background job that had hung since November (!), which prompted me to update rclone and try to clear out the backlog, which had grown to thousands of files.
What it appears is happening is that rclone is taking some steps before it even gets to checking files that causes a huge spike in API calls and then it ends up spending most of its time throttling API calls. I have left it running for upwards of 9 hours with no change. The log looks pretty much exactly like what I pasted in the other thread, over and over and over, though sometimes a few files might actually transfer before it freezes again.
It feels to me like --files-from is causing it to loop over each file and perform some kind of API heavy operation that does not happen otherwise.
Adding --tpslimit has no effect, but using --no-traverse results in the checking / transfers starting, but with extremely low (single kb/s) upload speeds.
Switching between --files-from and just pointing it right at the directory is like magic; the latter fires up and starts checking files immediately and begins uploading right away.
// If --files-from is set then a DirTree will be constructed with just
// those files in and then walked with WalkR
Does that mean that it maps the local path from each item in --files-from on to a remote directory and then lists the contents of that directory recursively?
It is not uncommon that the list of input files spans hundreds of different sub-directories. As I understand it, from reading about --fast-list, rclone can list the entire directory structure with relatively few API calls.
Could there be a situation in which it requires far more API calls to list each of hundreds of sub-directories than just to get a list of the entire remote?
What --files-from does is find each file individually. This will typically involve getting the parent directories in the directory cache first, but more or less it will take slightly more than 1 API call per file.
If you’ve got lots of files in the --files-from list then these API calls can add up.
So to answer your question, yes it is more than likely that at some point --files-from will be slower than doing a directory list. In v1.44 that is exactly what it does but in 1.45 I switched to finding the files individually as this makes it faster in the common case of copying just a few files into a very large list of files.
It might be that there should be a flag to control that behavior or a heuristic.
At the point that --files-from become slower than just doing a straight copy then you should probably just do a straight copy.
From my testing, google really doesn’t like the transactions rclone is doing for --files-from, I don’t know why.
I can definitely attest that Google does not like whatever rclone is doing!
But what you describe is, indeed, exactly my problem; I build large lists of files and then use separate calls to rclone to send the files to various endpoints. I have no problem with SSH endpoints, only Drive. The problem only cropped up because a particularly large set of files showed up, which caused a back-log, only exacerbating the problem.
The reason that I have to do it this way is that I send the list of files to a remote server that needs to know exactly what rclone has synced and that is the easiest way to ensure that a mistake isn’t made when a new file pops into the source directory during one of the copy operations.
I realize that I am probably an edge case, but it would be great to have a switch to have --files-from operate exactly as it does when I point rclone directly at the source directory with --exclude filters.
I made a test with a flag called --files-from-traverse which when set will do the directory traversal like before. I tried it with drive and it worked well!
It seems to be working exactly as 1.44 did — transfers started up right away using --files-from, --files-from-traverse and --fast-list together.
Perhaps the switch should reflect the usage case. For instance, rclone could automatically use the old traverse method when the list of inputs is greater that 10 and then you could have a switch to change that threshold, like --files-from-threshold. That way you could set a reasonable cut-off in a script and get the best of both worlds; fast transfers when the list is small, and not choking the API when the list is large.
Rclone could work out the number of distinct directories referenced in the files from list - call it D with the number of files N
For google drive:
Expected API calls without traversing = N + P
Expected API calls with traversing = D + P
Where P is the number of API calls it takes to find the IDs of the parents…
So if N < D, ie some files are in the same directory then traverse.
By this logic rclone should always traverse (except for N==D in which case the API calls are the same, so might as well traverse there too), which I don’t think is correct! Perhaps rclone should use a heuristic like if N > 0.5*D then traverse.
This doesn’t take into account the data transfer times - transferring listings are bigger than just looking at one file.
I think the google drive case is pathological for some reason though - the only way of finding individual objects is by effectively doing a query on a directory listing for the object name. Google will see that as no different to doing a directory listing so it is rate limiting the directory listings.
That would, of course, be awesome, but it would still be nice to have a manual override because Google seems to make changes without warning that completely break things that had been working without a hiccup for months.
If I understand the logic correctly, that would make sense because it is not so much about the absolute number of API calls as it is doing things quickly with relatively small numbers of files while ensuring that they actually get done when there is a sufficiently large set of files to go over the API cap — balancing speed and reliability.
For my use-case data transfer times are not particularly important, but reliability is. This whole issue cropped up because rclone stalled in a cron job in the background, leading to a backlog of files in the local cache, which further increased the number of files that needed to be uploaded, compounding the problem.
The cron job will send me a push notification if rclone exits with an error, but the particular way that it failed was tricky since rclone never hit any timeout limit on a connection. It literally sat in the background for two months, transferring no data because my script is not set up to worry about how long it runs. It seems like rclone should also figure out to exit with an error when it gets in an un-winnable fight with the API limiter.
You’ve been using --fast-list with the --files-from-traverse flag?
I’ve been thinking on this further. I wonder whether it might be best for the backend to give a hint as to whether traversing or not is preferred with a files from list.
I could perhaps run a test with the backends to discover it.
For instance with the S3 backend I’m pretty sure no traverse is always quicker…
I’ve fixed this here - testing appreciated I changed tack and made the default behavior as it was, and you’ll need the --no-traverse flag to enable the new behaviour. So in your case you don’t want the --no-traverse flag.
filter: Make --files-from traverse as before unless --no-traverse is set
In c5ac96e9e7 we made --files-from only read the objects specified and
don't scan directories.
This caused problems with Google drive (very very slow) and B2
(excessive API consumption) so it was decided to make the old
behaviour (traversing the directories) the default with --files-from
and use the existing --no-traverse flag (which has exactly the right
semantics) to enable the new non scanning behaviour.