Google Drive dedupe behaviour change?

Problem

A bash script I have used daily by cron for ages which does an rclone sync google drive to local and dedupe if duplicates are found and then a second sync has suddenly found a lot of duplicates to chew over although all except one of the files had not been touched since yesterday.

Until now identical named files in the same folder were distinctively named by the dedupe to make them compatible with the locally synced ext4 copy.

I am away from the computer and only have access to a sort of log copied to drive by the script but what seems to be happening is that dedupe is acting on some identical files in different folders (well, with multiple gogle drive labels)
Dedupe was last run a couple days before and acted as expected.

What is your rclone version (output from rclone version) 1.5.1.0

Which OS you are using and how many bits (eg Windows 7, 64 bit)

fedora 31 64 bit

Which cloud storage system are you using? (eg Google Drive)

google drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone dedupe

A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)

not at the computer. The extracts of log i have seem to suggest the path part of name is being truncated so that idemtical files in different longish folders seem to be in the same truncated folder.

My assumption was that Google's end had broken but now I am wondering whether it is a 1.51.0 thing and when rclone dedupe was last required/run now that it is automatically ( great in a good way) being updated by fedora repo dnf.

These are all long 'path' or label files.

Sort of hard to tell because the same script had been spitting out loads of errors for the earlier few days caused by barfed Google mail behaviour breaking gyb syncing and creating logs that sendgrid choked on.

Hmm.

Can you share an example of the dedupe with the problem and a debug log as I'm not following what full command you are running so it's hard to understand with a log to look at.

I am running an rclone sync from google drive to local linux ext4 drive. That worked fine today, as usual, but the log properly recognised there were duplicate file names (scanned files with names based on nearest minute done in quick sucession) so in the normal way the script runs a dedupe

rclone -v dedupe $googledrivesource --dedupe-mode rename

Instead of just picking up today's uploaded files with duplicate names, dedupe processes a binch of other files which are identical but in different (long) folders but had been sat there minding their own business for msny months.

For instance, this extract output has been chewed up by sed but might give an idea. I have tried to reset line breaks manually:-

2020/03/19 19:44:33 INFO : Google drive root '': Looking for duplicates using rename mode.
2020/03/19 19:45:40 NOTICE: JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF: Found 3 duplicates - deleting identical copies
2020/03/19 19:45:40 NOTICE: JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF: Deleting 2/3 identical duplicates (MD5 "369c9353560212695a4bdc0d397ae98a")
2020/03/19 19:45:41 INFO : JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF: Deleted
2020/03/19 19:45:42 INFO : JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF: Deleted
2020/03/19 19:45:42 NOTICE: JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF: All duplicates removed

The log claims to have deleted duplicate files and may have left one instance somewhere (I have backups and todays dedupe log info so no heat) but refers each time in the log to the same file/path and essentially the behaviour is just to delete files where multiple instances occur at different locations. It has done this in today's run for dozens of files that habe been sat there happily for years.

Something has changed behaviour and might be going rogue and I wanted to see if it was a problem on other installations.

All best, Ed.

This is definitely not the indended behaviour of dedupe (as it is not at all related to how general deduplication software works but rather a tool to fix duplicate-file issues that can arise on certain cloud-systems).

It also sounds very unlikely to be a simple bug in the code as it would actually need more logic to do this than the simple check it does now.

@ncw I guess you can tell us pretty quickly if there has even been a change to that code recently. It would be nice to exclude this right out of the gate.

I think this problem will be very hard to properly diagnose without a debug log (-vv) that can show us exactly what is happening under the hood.

1 Like

There have been two changes to dedupe from 1.50 to 1.51

and

If one causes a problem it will be the first change I think.

Can you see if the problem still exists with 1.50 ?

There was another bug report about dedupe deleting stuff it shouldn't have here which may/may not be related

If you could make a small reproduction of the problem somehow that would be really really helpful.

Will do what I can when I get back. For now I have remotely disabled the script. Going through the log of what it has suddenly decided to delete whilst reinstating might give me a clue about what makes those files or their context 'special' to create a small example.

Btw, rclone is really neat. Most of the myriad use cases are of no interest to me, but syncing cloud data to local as the start of a chain to proper backup helps me sleep at night.

My script only calls dedupe occasionaly when sync reports duplicates in a folder. It has only run the twice since fedora dnf upgraded the machine to clone 1.51 from 1.50. The other time it wrongly deleted 170 files which I spotted, restored and blamed on Google having a wobbly - those files were not shared with anyone, but had been in the past. Sems to be linked to 1.51.0.

Thanks for taking a look.

I can't easily replicate this - any help much appreciated!

I wonder if this is google deduplicating the files behind the scenes... so

Files were originally

  • file1 -> ID abcdefg
  • file1 -> ID ghijhsdf

Google notices that it is storing two copies of the same file and makes it

  • file1 -> ID abcdefg
  • file1 -> ID abcdefg

Instead.

When rclone dedupe comes to delete the file it deletes it by ID (because that is how the API works) and boom both files are gone with one delete.

That doesn't explain why rclone 1.50 didn't find them though...

From your report above

This should have left a file called JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF behind - did it not?

I have finally got to looking at this properly. It left one instance of the file

JP/Tax/NRLT/2013/Q3 to dec 2012/National grid wayleave oct 2012 115.30gbp.PDF

and I think I have manually reinstated another instance in another folder, but that is somewhat confused by subsequent changes so I have started looking at a different file.

The problem area I came across, and this might account for many of the situations for the earlier issues is an example of a file that exists in several locations/has several labels. Each location/label of the file is different. The file is, in this case, called About-24.docx. (It is a native Google Drive text document, but similar issues occur for pdf files). It started life as About.docx, but dedupe mangled the name of it earlier I think - it doesn't matter in this case.

[user@user ~]$ rclone -v dedupe edrclone: --dedupe-mode interactive
2020/03/30 14:56:29 INFO : Google drive root '': Looking for duplicates using interactive mode.
2020/03/30 14:57:48 NOTICE: Ed/Ed edu/Ed Wye/Cricket Week 1985/Commem Ball 1985/0000 - About-24.docx: Found 24 duplicates - deleting identical copies
Ed/Ed edu/Ed Wye/Cricket Week 1985/Commem Ball 1985/0000 - About-24.docx: 24 duplicates remain
1: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
2: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
.................. snip ...............................
20: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
21: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
22: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
23: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
24: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
s) Skip and do nothing
k) Keep just one (choose which in next step)
r) Rename all to be different (by changing file.jpg to file-1.jpg)
s/k/r>

So now add the file to another arbitrary folder not already containing the file (from the Google Drive web page, select, Shift Z, and pick a folder)

[user@user ~]$ rclone -v dedupe edrclone: --dedupe-mode interactive
2020/03/30 15:05:06 INFO : Google drive root '': Looking for duplicates using interactive mode.
2020/03/30 15:06:15 NOTICE: Ed/Ed edu/Ed Wye/Cricket Week 1985/Commem Ball 1985/0000 - About-24.docx: Found 25 duplicates - deleting identical copies
Ed/Ed edu/Ed Wye/Cricket Week 1985/Commem Ball 1985/0000 - About-24.docx: 25 duplicates remain
1: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
2: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
3: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
...................... snip ...............................
22: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
23: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
24: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
25: -1 bytes, 2020-01-30 20:53:41.583000000, MD5
s) Skip and do nothing
k) Keep just one (choose which in next step)
r) Rename all to be different (by changing file.jpg to file-1.jpg)
s/k/r>

Now it finds 24, not 25. They are not duplicates in the same folder, but the same Google Drive file in several folders. I'm guessing that if dedupe was run as -rename, the file would become 0000 - About-25.docx.

So to create a simple case, create a Google docs file and add it to another folder (the shift z thing again) and then run dedupe:-

[user@user ~]$ rclone -v dedupe edrclone: --dedupe-mode interactive
2020/03/30 15:34:07 INFO : Google drive root '': Looking for duplicates using interactive mode.
2020/03/30 15:35:29 NOTICE: Test for rclone.docx: Found 2 duplicates - deleting identical copies
Test for rclone.docx: 2 duplicates remain
1: -1 bytes, 2020-03-30 15:31:49.951000000, MD5
2: -1 bytes, 2020-03-30 15:31:49.951000000, MD5
s) Skip and do nothing
k) Keep just one (choose which in next step)
r) Rename all to be different (by changing file.jpg to file-1.jpg)
s/k/r>

And keep going. Choose rename for those files (maybe did all that twice) and rerun dedupe. The file seems to have been renamed but dedupe still thinks it is a duplicate so you get:-

[user@user ~]$ rclone -v dedupe edrclone: --dedupe-mode interactive
2020/03/30 15:40:09 INFO : Google drive root '': Looking for duplicates using interactive mode.
2020/03/30 15:41:19 NOTICE: Test for rclone-2-2.docx: Found 2 duplicates - deleting identical copies
Test for rclone-2-2.docx: 2 duplicates remain
1: -1 bytes, 2020-03-30 15:31:49.951000000, MD5
2: -1 bytes, 2020-03-30 15:31:49.951000000, MD5
s) Skip and do nothing
k) Keep just one (choose which in next step)
r) Rename all to be different (by changing file.jpg to file-1.jpg)
s/k/r>

I do appreciate having a file in more than one folder was always going to be problematic but this is legacy data which until1.51.0 rclone seemed to tolerate ok.

That may not cover all the issues, and certainly not the deleted files, but it is a start.

  1. Rclone dedupe is now, and I posit wrongly, treating google drive files in more than one folder as if they are duplicate filenames in the same folder - there is a crazy flawed symmetry. Rclone sync still gets this bit right, the flawed logic is only in rclone dedupe.
  2. The subsequent rclone dedupe rename behaviour just repeatedly, incrementally renames the same file so next time around rclone still thinks there is a duplicate.

Ed.

I tried doing that but I can't replicate the problem. I

  • created a small google test doc
  • I selected it and pressed SHIFT-Z
  • I created the dupe in a subdirectory of my test directory
  • I ran rclone dedupe on the test directory and the sub directory - both found no duplicates.

Can you post the config for edrclone: (with secrets removed) maybe you've got some setting in there. Are you using a team drive?

Thanks for checking this. I have replicated it again below, and included cat of rclone.conf. rclone.conf should be about as basic as you can get for Drive. The Googe account is just consumer with extra paid for storage. It is quite old, back to early Writely so Google will have done a_lot of devops to my online data so the underlying structures could have gained some weirdness that way. This time I did the rename (twice!) so you have a log of that. I'm deduping manually now and have remmed rclone dedupe out of my cron script so this is no real issue for me - maybe wait see if anyone else encounters anything?

My guess is that it is something at Google's end that has changed and major data loss is always an acident waiting to happen for anyone relying on these cloud data stores - warm glow inside because I have a local mirror of my cloud data courtesy rclone, gyb etc which I backup properly.

UPDATE - More weirdness at the end. although Google's web interface shows only one instance, running rclone ls again doubles everything up - at end of transcript below.

All best in these strange times.

Ed


[user@user ~]$ rclone --version
rclone v1.51.0

  • os/arch: linux/amd64
  • go version: go1.13.6
    [user@user ~]$ echo "Creating a root subfolder test" > /dev/null
    [user@user ~]$ echo "Create file dedupe_test as native Docs text file in there, and a subfolder testsub" > /dev/null
    [user@user ~]$ rclone ls edrclone:/test
    -1 dedupe_test.docx
    [user@user ~]$ rclone -v dedupe edrclone:/test --dedupe-mode interactive
    2020/03/30 19:59:11 INFO : Google drive root 'test': Looking for duplicates using interactive mode.
    [user@user ~]$ echo "Use shift Z to put dedupe-test also into subfolder" > /dev/null
    [user@user ~]$ rclone -v dedupe edrclone:/test --dedupe-mode interactive
    2020/03/30 20:01:29 INFO : Google drive root 'test': Looking for duplicates using interactive mode.
    2020/03/30 20:01:33 NOTICE: dedupe_test.docx: Found 2 duplicates - deleting identical copies
    dedupe_test.docx: 2 duplicates remain
    1: -1 bytes, 2020-03-30 19:56:25.765000000, MD5
    2: -1 bytes, 2020-03-30 19:56:25.765000000, MD5
    s) Skip and do nothing
    k) Keep just one (choose which in next step)
    r) Rename all to be different (by changing file.jpg to file-1.jpg)
    s/k/r> r
    2020/03/30 20:01:56 INFO : dedupe_test-1.docx: renamed from: dedupe_test.docx
    2020/03/30 20:02:00 INFO : dedupe_test-2.docx: renamed from: dedupe_test.docx
    2020/03/30 20:02:00 NOTICE: testsub/dedupe_test.docx: Found 2 duplicates - deleting identical copies
    testsub/dedupe_test.docx: 2 duplicates remain
    1: -1 bytes, 2020-03-30 19:56:25.765000000, MD5
    2: -1 bytes, 2020-03-30 19:56:25.765000000, MD5
    s) Skip and do nothing
    k) Keep just one (choose which in next step)
    r) Rename all to be different (by changing file.jpg to file-1.jpg)
    s/k/r> r
    2020/03/30 20:02:20 INFO : testsub/dedupe_test-1.docx: renamed from: testsub/dedupe_test.docx
    2020/03/30 20:02:22 INFO : testsub/dedupe_test-2.docx: renamed from: testsub/dedupe_test.docx
    [user@user ~]$ echo "Hmm... That is weird" > /dev/null
    [user@user ~]$ cat ~/.config/rclone/rclone.conf
    [edrclone]
    type = drive
    client_id = 2410------------.apps.googleusercontent.com
    client_secret = --------------
    scope = drive
    token = {"access_token":"-------------","token_type":"Bearer","refresh_token":"1---------------","expiry":"2020-03-30T20:42:03.231710041+03:00"}
    root_folder_id = --------------------

[user@user ~]$

[user@user ~]$ rclone ls edrclone:/test
-1 dedupe_test-2.docx
-1 dedupe_test-2.docx
-1 testsub/dedupe_test-2.docx
-1 testsub/dedupe_test-2.docx
[user@user ~]$

I've worked out why I've been having trouble reproding this, it is because it is fixed in the latest beta and that is what I've been testing with!

The fix must have been as a side effect of some other fix - I think this is the one: https://github.com/rclone/rclone/commit/dfc7215bf9b7840f438d006bf7b399974f5ebf42

Duplicated directory listings explains everything.

$ rclone-v1.51.0 ls drive:dedupetest
       -1 dedupetest.docx
       -1 dedupetest.docx
       -1 subfolder/dedupetest.docx
       -1 subfolder/dedupetest.docx
$ rclone ls drive:dedupetest
       -1 dedupetest.docx
       -1 subfolder/dedupetest.docx
2 Likes

Fantastic. Keep up the good work. Ed.

1 Like

Thanks for bearing with me because know I know what is happening and what fixed it I can close some more issues!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.