How to use dedupe on gdrive correctly?


#1

Hello,

I stumbled up on a problem here. Gdrive account with over 22 TB. I have a folder on my root server (arch) which tries to upload it into gdrive, But I get spammed all over the place with the ‘found dupe file and ignoring it’. I searched google how to solve that problem and found the command ‘dedupe’. But I am not quite sure how to use it correctly.

If I use the command:

./rclone dedupe remote: -vv

I hit instantly the ‘User Rate Limit’ and after several hours of waiting I just canceled it, because it seems like there was no any progress anymore.

2018/02/11 23:55:49 DEBUG : Using config file from "/a/b/.config/rclone/rclone.conf"
2018/02/11 23:55:49 DEBUG : rclone: Version "v1.39" starting with parameters ["asdf" "dedupe" "remote:" "-vv"]
2018/02/11 23:55:50 INFO  : Google drive root '': Modify window is 1ms
2018/02/11 23:55:50 INFO  : Google drive root '': Looking for duplicates using interactive mode.
2018/02/11 23:55:56 DEBUG : pacer: Rate limited, sleeping for 1.107571469s (1 consecutive low level retries)
2018/02/11 23:55:56 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)
2018/02/11 23:55:56 DEBUG : pacer: Rate limited, sleeping for 2.455559224s (2 consecutive low level retries)
2018/02/11 23:55:56 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)
etc
etc
etc

This is my moveto command:

./rclone moveto '/x/y/z' remote:'/m/n' -c --fast-list -v

i saw here the commands for dedupe, so if I just want to rename the dupes I have to?

./rclone --dedupe-mode rename remote: -vv

But how does this helps me if try to upload the same name file via moveto again - Should not it rename at that part, like ‘oh i tried to upload that file, but it already exists so I gonna rename it’.

Can someone explain this to me, please?


#2

I don’t know how to help you, but I’m struggling a lot with the google drive user request limit myself.
does anyone know exactly how --fast-list helps?
I would get maybe rclone size --fast-list or rclone ls --fast-list helping but with move/copy/moveto what/how does --fast-list help?

Should I start adding it to all my googledrive commands? I have plenty of spare ram, but I hit user rate limits almost every second all day.


#3

You might want to use --tpslimit 1 say to slow rclone down a bit. Also if you have sub directories you could break the work up by checking them first.

There is a bug in rclone or possibly in google drive which sometimes causes duplicated files or directories. I’ve noticed even the image backup from android makes duplicates in drive so maybe it isn’t rclone’s fault.

I don’t think there is anything wrong with your command - if you can reliably create a duplicate then please make a new issue on github with instructions in and I’ll try to fix it.


#4

It’s possible, but I’m not entirely sure yet, that running any rclone commands past and through a googledrive remote that needs a dedupe will fail.

I’ve been having trouble running rclone size on a remote that has 3 duplicate directories it warns me of, I just get endless rate limit error messages, but also endless resetting to 10ms on success messages. Almost like these duplicate directories are capable of putting rclone into an infinite loop, and I should mention that when targeting other locations with the same google account I’m able to use them normally.

The fact that the duplicate directories are on a team drive though and all the normal usage without these infinite errors is on a my drive, somewhat taints the experiment, and I haven’t tried dedupe yet, because I was hoping to use a move command instead, and worry about dedupe issues later. However of course I can’t use the move command, however that might be because my chosen destination google drive has hit it’s daily quota. The other account I’m using though hasn’t hit it’s quota and I’ve continued to migrate data to it all day.

After 24hours have passed so I can be 100% sure it’s not quota related (despite the fact I’m quite sure different google accounts all have their own quota) I’ll probably try dedupe, or honestly maybe just go straight to purge command.

-vv doesn’t put out any useful information whatsoever other than rate limit, user rate limit, resetting on success, infinitely, is there a level of verbose beyond -vv? like -vvv? because usually I just use -vv this time I tried -vvvvv and it didn’t seem to do anything:

2018/02/13 03:30:14 DEBUG : pacer: Rate limited, sleeping for 1.083340936s (1 consecutive low level retries)2018/02/13 03:30:14 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)
2018/02/13 03:30:14 DEBUG : pacer: Rate limited, sleeping for 2.34823252s (2 consecutive low level retries)
2018/02/13 03:30:14 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)
2018/02/13 03:30:15 DEBUG : pacer: Rate limited, sleeping for 4.423798463s (3 consecutive low level retries)
2018/02/13 03:30:15 DEBUG : pacer: low level retry 3/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:18 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:18 INFO : Google drive root ‘crypt3/159.nzEuFNJ WrK 2.1kS NvJKvIE uzxzKrC sCrtB uIzMv wIFD 3121’: Modify window is 1ms2018/02/13 03:30:24 DEBUG : pacer: Rate limited, sleeping for 1.132041335s (1 consecutive low level retries)2018/02/13 03:30:24 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:24 DEBUG : pacer: Rate limited, sleeping for 2.787567565s (2 consecutive low level retries)2018/02/13 03:30:24 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:25 DEBUG : pacer: Rate limited, sleeping for 4.55727093s (3 consecutive low level retries)2018/02/13 03:30:25 DEBUG : pacer: low level retry 3/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:28 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:33 DEBUG : pacer: Rate limited, sleeping for 1.932540156s (1 consecutive low level retries)2018/02/13 03:30:33 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:33 DEBUG : pacer: Rate limited, sleeping for 2.147214095s (2 consecutive low level retries)2018/02/13 03:30:33 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:33 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:33 DEBUG : pacer: Rate limited, sleeping for 1.125659894s (1 consecutive low level retries)2018/02/13 03:30:33 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:33 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:35 DEBUG : pacer: Rate limited, sleeping for 1.662077443s (1 consecutive low level retries)2018/02/13 03:30:35 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:35 DEBUG : pacer: Rate limited, sleeping for 2.179840358s (2 consecutive low level retries)2018/02/13 03:30:35 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:35 DEBUG : pacer: Rate limited, sleeping for 4.069973936s (3 consecutive low level retries)2018/02/13 03:30:35 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:36 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:41 DEBUG : pacer: Rate limited, sleeping for 1.367549377s (1 consecutive low level retries)2018/02/13 03:30:41 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:41 DEBUG : pacer: Rate limited, sleeping for 2.739896094s (2 consecutive low level retries)2018/02/13 03:30:41 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:41 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:41 DEBUG : pacer: Rate limited, sleeping for 1.663870139s (1 consecutive low level retries)2018/02/13 03:30:41 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:41 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:42 DEBUG : pacer: Rate limited, sleeping for 1.813196165s (1 consecutive low level retries)2018/02/13 03:30:42 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:42 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:42 DEBUG : pacer: Rate limited, sleeping for 1.549292195s (1 consecutive low level retries)2018/02/13 03:30:42 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:42 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:42 DEBUG : pacer: Rate limited, sleeping for 1.668501047s (1 consecutive low level retries)2018/02/13 03:30:42 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: User Rate Limit Exceeded, userRateLimitExceeded)2018/02/13 03:30:42 DEBUG : pacer: Resetting sleep to minimum 10ms on success2018/02/13 03:30:44 DEBUG : pacer: Rate limited, sleeping for 1.426000817s (1 consecutive low level retries)2018/02/13 03:30:44 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:44 DEBUG : pacer: Rate limited, sleeping for 2.960099615s (2 consecutive low level retries)2018/02/13 03:30:44 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:44 DEBUG : pacer: Rate limited, sleeping for 4.962188815s (3 consecutive low level retries)2018/02/13 03:30:44 DEBUG : pacer: low level retry 2/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)2018/02/13 03:30:44 DEBUG : pacer: Rate limited, sleeping for 8.199656769s (4 consecutive low level retries)2018/02/13 03:30:44 DEBUG : pacer: low level retry 1/10 (error googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded)

My conclusion is based on a LOT of wild guesses here though, I don’t really know why things suddenly stopped working for this particular directory. Thankfully I still have a copy of all this data on my ACD as well as my GCC disk (although I copied it all to this new location this error cropped up right around the time I went to check it over I think I was going to use a size command before a move command?)

Oh also, this issue doesn’t seem to effect rclone lsd at all, even when targeting one of the duplicate directories. Although I have no idea how to target the other duplicate, so of course using an rclone command on the prime duplicate works; but I guess when using rclone on the parent of a duplicate issues arise? Because rclone gets confused between the prime and subprime duplicates? Somehow? Maybe? Maybe the loop rclone gets stuck in isn’t infinite, maybe it’s just long, and since I’ve overall been a heavy gdrive user this week I can’t get through the loop, hence why I’m gonna take 24hours off before coming back to this issue (even though through other tasks I’ve proved I’m not currently effected by the daily quota as far as I can tell for other tasks).

I’d be interested to hear how the OP solves his problem.

TL;DR
I’d be especially interested to hear if the OP is using a mydrive or a teamdrive? Because I’ve never seen this issue in a mydrive, but I think I might be suffering from it myself in a teamdrive that I uploaded to from two different gsuites accounts.


#5

That is a normal part of google drive doing rate limiting unfortunately. You should find that either you get an ERROR or it succeeds eventually. I don’t see any ERRORs in your log.

I don’t think it is related to the duplicates, I think it is just rate limiting :frowning: Note you can use --tpslimit 1 say to do your own pre-emptive rate limiting which some people have reported success with.


#6

Thank you very much for your answers guys.

Well, I will try the

--tpslimit 1 

command for the next time and will test if this will do anything for the good.

There is a bug in rclone or possibly in google drive which sometimes causes duplicated files or directories

Yeah, I saw several times folders with double content. I thought maybe the creator did some weird stuff, because I simply just upload the files which I get transferred from them. I can not control the content all of those files, but randomly I click on my google drive and see that buggy behaviour. But for now it is ‘okey’ for me at least, because I do not see it that often and they do not create that much ‘waste traffic’ (I hope).

Unfortunately I can not reproduce this bug. But if I can, I will immediately make a bug thread.

@left1000

I use mydrive, not a team account. I have written a simply python script (I start that manually to over watch the process) that removes the duplicated folder/files from my local disk. It is a dirty fix, but it works for now and does not consume much time.

Additional thoughts to the User Rate Limit stuff. I would bet money that google just shadow limited some actions. Maybe you got limited at how many files you can create/edit/read a second. (it is ~30 per second for me without shadow limit and ~2 per second with shadow limit). There nothing on our side, what we can do.


#7

I’ve moved several terabytes through other google accounts during the same amount of time this one has been “stuck”. Haven’t gotten this account to do anything meaningful in 36-48hours roughly.
So, I just went to the google drive website and deleted these directories. This as of yet hasn’t seemed to have helped.
Maybe this account has been banned for more than 24hours due to heavy usage? Has anyone ever heard of that happening?, I think this account did manage to move 800GB or so all in one day, but the other accounts have managed 750GB per day for multiple days in a row.

edit: So I just gave up. I’m now moving data from a harddrive to this broken account’s mydrive. It’s working flawlessly, transfering data, experiencing the normal about of pacer messages with the normal amount of copy new messages.

Yet trying to rclone size that one broken team drive directory failed for an hour straight. There’s definitely something wrong with that team drive, or that directory, I have no idea what though. If a -vvv ever gets added, and I still have access to this broken teamdrive directory I’ll try running that, because I have no idea what’s going on with it. Even moving from --tps-limit 1 to --transfers 5 and no “–tps-limit” at all.

But if the OP is not using a team drive, I have no idea what’s going on. I think google drive directories can just rarely get bugged out? for comparison I was able to make 90,000 checks (size date only) on mydrive in 30minutes, and 0 in 27minutes on this broken directory.


#8

You have to pay attention, that they have other limits (and I guess shadow limits as well) for team drive. I can not find the link anymore on the google help site, there was written how many files you can maximal have. I think it was around 250k files.

edit:* Google Team Drive file limit reached
link for more info about that.

You could contact google support and ask them if something is wrong with your broken team drive, but usually they just answer with copy&paste answers and none of them are helping.

Like I mentioned before, we can nothing do to locate exactly the problem with all that ‘maybe shady’ limits involved. We can just prey and obey.

I have an account for gdrive which uploads nearly for over 60 days every day the full 750 GB and no ‘ban’, but if they would start banning accounts, because of heavy using, then this would outrage many, many users. I would heard of that. I also check r/DataHoarder/ and 4chan on daily basis, if there would be any ban going on for heavy using, I would know. Also I saw users with a high number in xxx TB and some PB, they have no problems so far, just the usually shadow limits.

edit2: Found the link with their limits:
https://support.google.com/a/answer/7338880


#9

I linked that answer in another thread.
As of right now I am 100% I have created a directory or subdirectory on gdrive that rclone is entirely incapable of interacting with.
rclone is in fact stuck in an infinite loop and the rate limit errors are only a result of that infinite loop, triggering only 50% of the time, it’s the successes which reset the timer which are the true errors.
rclone needs an infinite number of successes to result in any feedback or result at all.

In order to coordinate with the OP I am now going to detail what I have done. Hoping there is some common thread which has caused us to have these “polluted” drives.

I am in the progress of using odrive to download ACD to GCC to then upload to gdrive. From now on references to “local” drives will refer to GCC persistent disks, the actual physical media in this case is a decade old and in a closet unused.

  1. I began randomly uploading to gdrive, for a short period of time, from multiple localdisks small amounts of data from each using rclone copy
  2. I stopped this, realizing the 750gb quota would be a problem and created googleaccount2 alongside googleaccount1
  3. I completely uploaded localdisk1 a 2tb disk to teamdrive using googleaccount1 and rclone copy
  4. I almost completed uploading localdisk2 a 1tb disk to teamdrive using googleaccount2 and rclone copy
  5. I began uploading localdisk3 a 2tb disk to teamdrive using googleaccount1 and rclone copy
  6. I saw an error in the screen uploading localdisk2, i googled this error, i learned that teamdrives have a maximum depth of 20subdirectories and I’d reached past that limit
  7. I stopped uploading anything to teamdrives
  8. I used google’s own website to instantly move localdisk1 from teamdrive to googleaccount1:mydrive instantly
  9. this resulted in a lot of errors and funny business addressed in other threads, so I then isolated my actions in relation to localdisk2 to the mydrive of googleaccount2 (funny errors included bypassing the 250,000 file limit on teamdrives as well as the hierarchy limit I’d reached earlier).
  10. I then decided to leave localdisk1 alone for the time being. Sunday evening I used an rclone move command to move all of localdisk2 from teamdrive to mydrive for googleaccount2.
  11. Monday morning I got up, I saw that there had been roughly 1error and 90,000 successes. (although on this issue my memory is hazy and I didn’t bother to record that log).
  12. I attempted to use rclone on teamdrive:localdisk2 and no command worked EXCEPT lsd
  13. today Tuesday evening I decided that teamdrive:localdisk2 was somehow “polluted” so I did rclone copy of localdisk2 to googleaccount2:mydrive and I got roughly 90000checks and 100-200 transfers. This means therefore that the polluted directory teamdrive:localdisk2 contains at most a few hundred files (although the directory substructure is roughly 6000 directories which might still all exist)

At this point I might try using combinations of lsd and size to find which subdirectories specifically fail or pass both of these commands, but either way I’ve discovered an infinite loop for rclone. rclone size remotedrive: shouldn’t be able to target a drive with only a few hundreds files at most and return 10000s of successful actions without generating any feedback or completing.

In conclusion though: I blame google entirely, I do not think rclone is at all to blame here. I was sure that this error was caused by my use of the teamdrive and the teamdrive limits. The OP not using teamdrives at all though makes me clueless what we did in common. Maybe if this is at all readable though, he can guess.

My initial guess was that when I got the error for bypassing 20subdirectories worth of depth that that entry was corrupted on my teamdrive and when rclone tries to navigate past it, it can’t? because google won’t let it? but rclone has no feedback to handle this case and so says nothing even under -vv

This theory doesn’t explain why or how the OP could have the same issue though if he never used teamdrives.

TL;DR I still had my local copy of these files, I don’t need to repair this error at all. I am just curious to know what caused it and if there’s a cure, just in case it ever were to strike again.

edit: It is possible this google account and team drive have some sort of “shadow” ban like the OP said, although this is not a ban anyone has disclosed the existence of before, it doesn’t limit api requests or bandwidth in anyway through that google account, nor does it appear to end after 24hours, nor do I have any idea what triggered it. Since I never plan to repeat most of the above steps though, I doubt I’ll run into it again. If it is a “shadow” ban it’s still putting rclone into an infinite loop by sending rclone some sort of silent “Shhadow” error that rclone can’t feedback to the user.

edit2: nevermind, none of this is true. I was able to do all the things I said I couldn’t do. I think though it had nothing to do with quotas or bans. The step that eventually gave me access was manually deduping the duplicate folders. At least in my mind. Now that it’s fixed it’s hard to say did I have a 36-48hour shadowban or was the manual dedupe the solution? Given that I was entirely wrong, I should maybe delete my posts, I won’t though unless someone suggests it. I prefer to log even incorrect information.

I will elaborate on why I’m not sure if manual deduping mattered or not. See I engaged in the manual deduping. I tried rclone size and rclone move, they didn’t work. Then I moved on and used the account for other things and it was clear I wasn’t banned in terms of api requests or bandwidth. Then at the end of the day I tried the broken directory again and it just worked. roughly 36-48hours after it last worked. However nick has said, and I’ve noticed, that sometimes manual commands given via gsuites drive website can take effect after some inexact amount of delay. So I’m not sure if it was some secret selective weird duration ban, or if manual deduping helped with delayed effect.