[help] Big cloud storage, sync to local in parts

cmad · August 22, 2019, 1:23pm

What is the problem you are having with rclone?

Hi,

I have about 2M files on my google drive spread across multiple folders (mostly .pickle files from a python code) that I need to get out of google drive since it was hindering the usability of the code.

I'm currently using something like:
/usr/bin/rclone copy --ignore-existing --verbose --transfers 50 --checkers 15 --contimeout 120s --timeout 500s --retries 30 --low-level-retries 10 --stats 1s filestreamrclone_drive:<> /home/<>/<>
Courtesy of the rclone-browser. However, there are many folders, since the python venv was also there. I was finally able yesterday to get rclone to work with my GD since I was actually thinking that google cloud storage was the equivalent to file stream.

What I would like to know is that if there is a way for me to get a full or partial file list from remote and then feed it to rclone so that it does not have to scan all the files and all the folders on remote to find files to transfer. Basically so that I can "resume" from a previous state instead of having to start it all over again.

The remote is currently static, as in, no files are being added and likely they will not be.
I have my own client_id and client_secret.

Thanks in advance!
Cheers,
Carlos

What is your rclone version (output from `rclone version`)

rclone v1.48.0

Which OS you are using and how many bits (eg Windows 7, 64 bit)

ElementaryOS 64bits

os/arch: linux/amd64
go version: go1.12.6

Which cloud storage system are you using? (eg Google Drive)

Google Drive Business

ncw · August 22, 2019, 9:30pm

If you've got enough memory, add --fast-list to the transfer - this will cause it to work much faster!

You can list the files with rclone lsf -R --files-only filestreamrclone_drive: and you can feed them back into rclone with --files-from.

However you'll still be doing listings unless you add the --no-traverse flag and google drive really hates that for some reason - it will go very very slowly with it.

So I think your best bet is --fast-list.

thestigma · August 23, 2019, 12:15am

Nick, I've been meaning to ask you about this anyway and I think it is relevant here.
Does --fast-list still count as the same number of API calls, except it just gets packaged up into one transaction? Also - are you sure --fast-list obeys the drive-pacer? It seems to me like it does not.

I've been noticing that as a Gdrive grows with more files it seems to have a greater chance of the --fast-list stalling completely so the operation never finishes. And if the --fast-list completes it always seems to be followed by a lot of 403 rate-limit errors. The google API site's traffic graph (although it is not super accurate in terms of timing) also seems to spike drastically on a --fast-list as if there are a bunch of API-calls happening all at once.

As I am approaching 40K objects I feel like --fast-list is starting to become too unreliable for me to use on automatic scripts that can't afford the risk of stalling. With 2 million (I thought the default max was 400K) this problem (if it is real as I imagine) would be massively exacerbated.

If I am onto something here - a solution might be to make --fast-list work in a "chunked" fashion where it has some reasonable limit to how much it does all at once rather than literally listing the entire drive at once. It would still be way faster than normal listing I imagine. An even better solution would be to make it obey the drive-pacer and not package together more API calls than it knows it is allowed to use.

Maybe I am way off on this one, but this unexpected behavior has been bugging me for a while.

ncw · August 23, 2019, 9:51am

No, I don't think so. It will do significantly less API calls that a normal listing, normally by a factor of 10 or so.

I checked the code - it definitely uses the pacer. If you collect logs with -vv --dump headers you should be able to see whether it it is working or not. Adding --log-format time,microseconds will help for exact timings.

Doing

rclone size drive: -vv --fast-list --dump headers --log-format time,microseconds 2>&1 | tee /tmp/drive.log
grep REQUEST /tmp/drive.log | less

There seem to be about 10 transactions per second.

Interesting... Do you know why fast-list is stalling? Is it just lots of retries? You could try changing the --drive-pacer-min-sleep or set --tpslimit

Fast list does have some internal chunking which isn't brought out as config but you could twiddle with these values in the source.

It might be that those numbers need tuning as google rate limits have changed since @B4dM4n did the original code.

The amount of parallel work is controlled by --checkers so you could try decreasing that and see what difference it makes.

If you asked me to guess, I'd say google have implemented some sort of rate limiting for list commands when you issue too many of them in a row...

thestigma · August 23, 2019, 1:58pm

I think both disabling burst and lowering tpslimit significantly didn't end up helping for me, but I need to re-run some proper tests on this and give you some hard info to work with. It's just going to take some time to collect examples as I only see stalls maybe 5% of the time.

I will contact you about this later. For now I don't want to derail this thread further with this tangent

system · November 21, 2019, 1:58pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

[help] Big cloud storage, sync to local in parts

What is the problem you are having with rclone?

What is your rclone version (output from rclone version)

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Which cloud storage system are you using? (eg Google Drive)

What is your rclone version (output from `rclone version`)