Fast-List overwriting and endless sync with gsuite/drive

#1

I’ve got a folder with a substantial number of files (upwards of 2.8M, approx 150GB)
It has two sub directories, one with approx 1.8M, the rest in the other. Inside each of those there is approx 13000 folders.
In order to sync it in manageable chunks (each team drive caps out at approx 390k files) I’ve split the sync into sections using --include, each going to a separate team-drive. Capping out at around 2000 folders in each sync operation.

I’m running into 3 issues with this. The only one that I think is a bug is #2, the rest are configuration related (eg, There’s probably a more suited command for what I’m doing)

  1. A standard sync transfers properly but does not actually finish. I suspect that it’s scanning the remaining 1.7M files in the same subdirectory instead of just the 94k I asked it too, The initial scan only took 15 minutes, but it’s now at past 2 hours and no change. Is there a better way I can use to make it only look at a section of folders than using --include?

Transferred: 21.081M / 21.081 MBytes, 100%, 6.436 kBytes/s, ETA 0s
Errors: 0
Checks: 93774 / 93774, 100%
Transferred: 248 / 248, 100%
Elapsed time: 2h6m23.1s

  1. Using --fast-list vastly improves the time it takes to do the initial directory scan of the 94k files (3 minutes instead of 15), but once it’s done doing so, it just starts uploading, even if the file was already on the drive.
    It doesn’t “check” any of the files against each other before doing so.

  2. Performance. I’m getting no where close to the 2-3 transfers per second I would expect to cap out at. Bandwidth wise I’m limited to around 120K/s peak, so I limited each sync operation to 30K to make it somewhat manageable. Any time a large file hits the queue it would obviously slow down the transfers per second till it completes, but the vast majority of these files are tiny, under 50 bytes.

The best setup I’ve gotten so far is with 3 sync operations running at once, First one being for a folder with almost exclusively 500K+ files at the default 4 transfers at a time, The other two on the smaller files with 20 transfers each. They go in bursts, I’ll see all 20 files in the queue at 0%, then 60 seconds later, all 20 will swap to 100% at once and finish within 1-2 seconds, Then the other sync operation will do the same thing. Overall I was able to transfer 400k files in just over a week, which only works out to 0.6 files per second, No where close to the expected rate limit on google drive’s end. Lowering the active transfer amount or running only a single sync operation without --bw-limit at a time was proportionally slower. I am not getting any rate limit or throttle warnings currently so I suspect this is purely on my end, not google’s. Is my overall bandwidth too low to make hitting 3-4tps feasable?

Here’s an example of the command I’m using. For problem #2 the only change is adding the --fast-list to the end.

rclone sync E:\MainDirectory\Subdirectory1 Drive1_14-16k:Subdirectory1 --include Q1[45][0-9][0-9][0-9]/** --bwlimit 30k -P --transfers 20

The intent of the --include line is that I want it to upload everything from Q14000 to Q15999 in this sync. I have another one for Q16000 to Q17999 and so on. Due to 3rd party/proprietary software I can’t easily mess with the existing folder structure to split them out at this time.

0 Likes

#2

Did you use --fast-list with this? --fast-list does a complete scan then does the filtering afterward which might explain these results.

Or not rooting your --include might explain them too - see later.

rclone should work identically with or without --fast-list. I suspect this might be caused by duplicate directories (as in two directories with the same name in the same folder). I suggest you run rclone dedupe to see if that is the case.

Are you using your own client_id? If not then getting your own will help with the transfer speed.

You want to root your --include with a leading / assuming those folders are in the root otherwise rclone will be looking for folders called QXXXX in any folders (including all the other Q folders).

0 Likes

#3

Adding the leading / doesn’t seem to have had any effect, same delay afterwards, Makes sense to be there though.

Running a dedupe now, just sitting at a blank cursor for 15 mins, I’ll let it run overnight and see what happens.
I don’t see any duplicate folders in either google drive web UI or in Drive stream. Even with the leading / only the --fast-list version tries to upload files on sections I’ve fully uploaded.

I tested the same 94k folder as previously, It “checked” 1125 files and then started re-uploading. If I remove the --fast-list entry it check’s the full 94000 then just counts up the timer with 100% completed.

I hadn’t tried my own ID, Hadn’t considered it due to not having any rate-limit errors at all. Will try tomorrow.

1 Like

#4

Well the Dedupe finished overnight, no status/msg of any kind. I gather that means it didn’t find any duplicates.

I’ve got my own ID in there as well, no noticeable change in performance, At this point I’m pretty sure its just my bandwidth limitations.

There is however a bit of a difference with the --fast-list now. On the exact same command I now have it check 90774 files and then try and upload 3248, Is it possible this is related to time-stamp accuracy somehow? Does the recursive list return one less decimal point or something like that?
If that’s the case I’ll just re-upload those last few and use --fast-list for everything from now on. The performance benefit seems to be worth the time to re-upload this once.

I went and double checked some of the files that it’s trying to re-upload and they are definitely already there.

0 Likes

#5

–fast-list does the same thing when it reaches the end. Just keeps counting up at 100% with eta of 0s

Transferred: 102.101M / 102.101 MBytes, 100%, 18.268 kBytes/s, ETA 0s
Errors: 0
Checks: 32683 / 32683, 100%
Transferred: 1156 / 1156, 100%
Elapsed time: 1h35m23.2s

And re-running that exact same attempt after it finished, already trying to re-upload 802 files.

Transferred: 1.428M / 79.695 MBytes, 2%, 13.698 kBytes/s, ETA 1h37m30s
Errors: 0
Checks: 33037 / 33037, 100%
Transferred: 7 / 802, 1%
Elapsed time: 1m46.7s

0 Likes

#6

Did you run it with -v? It should have printed stuff about duplicate directories if that was the problem.

I’d like to see a log of what you are doing with -vv with and without --fast-list - can you post them somewhere? Or alternatively email them to me nick@craig-wood.com - put a link to this page in please.

0 Likes

#7

Edit - Dedupe with -v finished with only a single entry “Google drive root ‘Directory1’: Looking for duplicates using interactive mode.”

I’ll see if I can figure out a way to capture the output of sync, In about a second I had several thousand lines go by with -vv

0 Likes

#8

Alright, logged and email’d

0 Likes

#9

Thanks for the logs.

Here are some things I observed…

In each run the total of checks + transferred is 33839 so at least that is consistent.

What it looks like is that --fast-list is missing some of the files. We did find a bug which could cause this which was fixed in 1.46 (released on Saturday) so it would be worth trying that - see the latest release.

I think the creating all the empty directories each time is probably this bug: https://github.com/ncw/rclone/issues/2869 - rclone is creating all the directories that have excluded files in. Does that look correct?

0 Likes

#10

The directories being created at the end are ones I would expect to be there (they’re empty directories inside the ones I match using --include). I wonder if maybe google drive isn’t reporting them in the list because they are empty? I’d be okay with them not being copied if that’s what it comes down to. (Whether by default or specified option). With that in mind, it might be related, but it’s not creating any directories it’s not supposed to (I’m also not currently using --exclude)

I’ll give that update a try and see what it does.

0 Likes

#11

Looks better with 1.46 but still re-uploading some.

14xxx-15999 worked fine twice in a row so I tested the next batch in order. On 16000-17999 It re-uploaded 373, then 175 and 175 in each successive attempt (the same files on the last attempts too, which was odd, Most of which were from J16610). So it’s possible something about those specific files is causing it.

0 Likes

#12

OK…

I think it is probably doing more work than it needs to.

The directories should be reported regardless of whether they are empty or not.

I just noticed you are using a Team drive. In my testing when I originally did Team drive support, I noticed exactly this sort of problem. I put it down to eventual consistency on team drives - the uploads taking a while to appear in the listings. Do you think this is the problem now?

We can test this though…

If you run

rclone lsf -R --include '...'  remote:...  | sort > list1

with your --include list, then try that with --fast-list to a different file. Try it a couple of times with --fast-list. I suspect you are going to see different results.

If you’ve got a specific directory which always shows the problem then can you email me the results of rclone lsf -R --fast-list remote:dir -vv --dump bodies which will probably be quite big!

0 Likes

#13

It’s gotta be that file delay that’s causing it.

I re-ran both of the ones that acted up yesterday and they’re 100% now without any re-uploading.

So I guess with --fast-list enabled I just need to compensate for that by not running it back to back. (shouldn’t be a problem once my first initial upload is completed and I switch it to a scheduled service).

I suppose team drives are caching their full lists for some period of time (I would guess as much as an hour).

I guess that means the only thing that’s really an issue is that it’s trying to re-create the blank directories at the end. If they already exist on the drive the process is quick, but if not, it appears to be doing nothing on the progress display while it does so (around 1s per folder in my situation)

0 Likes

#14

Strange that it should be the --fast-list that gets cached and not the other list.

The directories it creates - are they in the root of the --include?

Any directories that rclone lists should be in the directory cache already so mkdir should be instant.

Ah, I wonder if --fast-list isn’t adding them to the directory cache.

Do you see this long pause creating directories if you don’t use --fast-list?

0 Likes

#15

The directories it creates - are they in the root of the --include?

Sub directories of the ones matched with --include
Eg

J16123\folder

Do you see this long pause creating directories if you don’t use --fast-list?

The pause only appears the first time these directories are actually created. The second time through it’s able to do 1000+ in under a second. (although the log still says creating)

Edit - to clarify, no. It seems to be the same duration with and without --fast-list, Purely dependent on whether its the first run or not.

0 Likes

#16

Ah OK, so it sounds like it is working as intended, creating the empty directories on the first run only.

This is perhaps a bit misleading and I think it could probably be fixed.

I made an issue about it: https://github.com/ncw/rclone/issues/2977

0 Likes

#17

Sounds good. I would propose the following for the inverse scenario then (what brought this up in the first place)

Adding an entry to -P to show folders that still need to be created or counting folders in the same tally that does files (So it doesn’t look like it stalled out at 100% progress, when it’s still working)

Also maybe a warning when using --fast-list with team drives.
Something like this?

Warning: Rclone has detected that you’re using --fast-list with sync/copy/move to or from a Google Team Drive remote. Due to the way Team Drives currently cache the full directory listing there can be situations where Rclone is unable to see recently created files on the drive and cause the command to miss or re-copy files. Running the command without --fast-list should allow Rclone to properly detect all files on the remote. Waiting a currently unknown amount of time after a file has been uploaded (Estimated to be less than an hour) should also be effective.

0 Likes

#18

Maybe I could count these as Other or something like that… Normally rclone creates directories as part of the sync. Maybe doing that would be best rather than waiting until the end.

Fancy sending a PR for that? The only file that needs patching is drive.md?

0 Likes

#19

Yup, I can do that. Will take a read over it and likely submit later on this week.

0 Likes

#20

For the benefit of anyone else who comes across this.

The behavior of Team Drives and --fast-list appears to be worse than I initially thought.

I had to move 200k files between two team drives and adjust my scripts to match. It’s been over 2 days now and there are still files not showing up when using --fast-list on the new location. It seems to be indexing files at approx 1500-2000/hour. I don’t know if this speed is per account or per drive. I’ve got 8 such drives all currently in this status. I suspect a background operation in google drives that can only process the files so fast and add them to the main list.

Non --fast-list still operates normally and assuming nothing goes wrong I should be able to start using --fast-list again once google catches up.

0 Likes