Rclone dedupe stopped working recently

What is the problem you are having with rclone?

rclone not removing duplicate files via dedupe command.

What is your rclone version (output from rclone version)

rclone v1.52.1
- os/arch: linux/amd64
- go version: go1.14.4

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Debian Buster 64 bit

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone dedupe skip GD2_API1:'/01 - Portable Apps' --dry-run --log-level "DEBUG"

The rclone config contents with secrets removed.

[GD2_API1]
type = drive
client_id = ***REDACTED***.apps.googleusercontent.com
client_secret = ***REDACTED***
token = {"access_token":"***REDACTED***","token_type":"Bearer","refresh_token"
:"***REDACTED***","expiry":"2020-06-14T01:01:04.878186594+01:00"}
root_folder_id = ***REDACTED***

A log from the command with the -vv flag

2020/06/14 00:19:17 DEBUG : rclone: Version "v1.52.1" starting with parameters ["/usr/bin/rclone" "dedupe" "skip" "GD2_API1:/01 - Portable Apps" "--dry-run" "--log-level" "DEBUG"]
2020/06/14 00:19:17 DEBUG : Using config file from "/root/.rclone.conf"
2020/06/14 00:19:21 DEBUG : fs cache: renaming cache item "GD2_API1:/01 - Portable Apps" to be canonical "GD2_API1:01 - Portable Apps"
2020/06/14 00:19:21 INFO  : Google drive root '01 - Portable Apps': Looking for duplicates using skip mode.
2020/06/14 00:19:45 NOTICE: TeraCopy230/Options.ini: Found 2 duplicates - deleting identical copies
2020/06/14 00:19:45 NOTICE: TeraCopy230/Whatsnew.txt: Found 2 duplicates - deleting identical copies
2020/06/14 00:19:45 NOTICE: TeraCopy230/Transfer.log: Found 2 duplicates - deleting identical copies
2020/06/14 00:19:45 DEBUG : 18 go routines active

The dedupe command has been working fine for me for months, I have a cronjob that runs every so often to cleanup any duplicates, I also have a cron that runs everyday at 4am and installs the latest stable version of rclone.

At some point I don't know when the dedupe command has decided to no longer work. Although to rule this out, I did download the Options.ini file above, and create a new directory and uploaded that files 4 times using the web UI, and ran the command again on this new folder. As you can see this bizarrely worked fine, so I am at a loss as to what the issue is.

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone dedupe skip GD2_API1:'/Rclone Debug/dedupe' --dry-run --log-level "DEBUG"

A log from the command with the -vv flag

2020/06/14 00:27:01 DEBUG : rclone: Version "v1.52.1" starting with parameters ["/usr/bin/rclone" "dedupe" "skip" "GD2_API1:/Rclone Debug/dedupe" "--dry-run" "--log-level" "DEBUG"]
2020/06/14 00:27:01 DEBUG : Using config file from "/root/.rclone.conf"
2020/06/14 00:27:08 DEBUG : fs cache: renaming cache item "GD2_API1:/Rclone Debug/dedupe" to be canonical "GD2_API1:Rclone Debug/dedupe"
2020/06/14 00:27:08 INFO  : Google drive root 'Rclone Debug/dedupe': Looking for duplicates using skip mode.
2020/06/14 00:27:09 NOTICE: Options.ini: Found 4 duplicates - deleting identical copies
2020/06/14 00:27:09 NOTICE: Options.ini: Deleting 3/4 identical duplicates (MD5 "53143360267806d6478a3a76ba4afc00")
2020/06/14 00:27:09 NOTICE: Options.ini: Not deleting as --dry-run
2020/06/14 00:27:09 NOTICE: Options.ini: Not deleting as --dry-run
2020/06/14 00:27:09 NOTICE: Options.ini: Not deleting as --dry-run
2020/06/14 00:27:09 NOTICE: Options.ini: All duplicates removed
2020/06/14 00:27:09 DEBUG : 4 go routines active

Ah wait I figured it out for myself already! So I dropped the skip part of the command and ran it interactively, the output is shown below, the files shown when skip is in the command aren't actually duplicates, the filenames are the same but the MD5 hashes are different, despite this it still showing the files as being duplicates, this is what threw me.

A log from the command with the -vv flag

2020/06/14 00:33:12 DEBUG : rclone: Version "v1.52.1" starting with parameters ["/usr/bin/rclone" "dedupe" "GD2_API1:/01 - Portable Apps" "--dry-run" "--log-level" "DEBUG"]
2020/06/14 00:33:12 DEBUG : Using config file from "/root/.rclone.conf"
2020/06/14 00:33:24 DEBUG : fs cache: renaming cache item "GD2_API1:/01 - Portable Apps" to be canonical "GD2_API1:01 - Portable Apps"
2020/06/14 00:33:24 INFO  : Google drive root '01 - Portable Apps': Looking for duplicates using interactive mode.
2020/06/14 00:33:37 NOTICE: TeraCopy230/Transfer.log: Found 2 duplicates - deleting identical copies
TeraCopy230/Transfer.log: 2 duplicates remain
  1:         2706 bytes, 2016-12-04 16:40:32.889000000, MD5 98dd6e47ffbd19488da2ae2434f64fe9
  2:         5080 bytes, 2014-10-05 12:36:02.000000000, MD5 3fb3583b002da4bc1bfc622443e70f78
s) Skip and do nothing
k) Keep just one (choose which in next step)
r) Rename all to be different (by changing file.jpg to file-1.jpg)
s/k/r> k
Enter the number of the file to keep> 1
2020/06/14 00:34:25 NOTICE: TeraCopy230/Transfer.log: Not deleting as --dry-run
2020/06/14 00:34:25 NOTICE: TeraCopy230/Transfer.log: Deleted 1 extra copies
2020/06/14 00:34:25 NOTICE: TeraCopy230/Options.ini: Found 2 duplicates - deleting identical copies
TeraCopy230/Options.ini: 2 duplicates remain
  1:          487 bytes, 2016-12-04 16:40:32.758000000, MD5 17bba9c5e9affa1d9519bc712ac048d6
  2:          487 bytes, 2014-10-05 12:36:47.000000000, MD5 53143360267806d6478a3a76ba4afc00
s) Skip and do nothing
k) Keep just one (choose which in next step)
r) Rename all to be different (by changing file.jpg to file-1.jpg)
s/k/r>

If I add skip back in and --checksum also, it still detects duplicates, but deletes nothing.

A log from the command with the -vv flag

2020/06/14 00:38:51 DEBUG : rclone: Version "v1.52.1" starting with parameters ["/usr/bin/rclone" "dedupe" "skip" "GD2_API1:/01 - Portable Apps" "--dry-run" "--checksum" "--log-level" "DEBUG"]
2020/06/14 00:38:51 DEBUG : Using config file from "/root/.rclone.conf"
2020/06/14 00:38:56 DEBUG : fs cache: renaming cache item "GD2_API1:/01 - Portable Apps" to be canonical "GD2_API1:01 - Portable Apps"
2020/06/14 00:38:56 INFO  : Google drive root '01 - Portable Apps': Looking for duplicates using skip mode.
2020/06/14 00:39:11 NOTICE: TeraCopy230/Whatsnew.txt: Found 2 duplicates - deleting identical copies
2020/06/14 00:39:11 NOTICE: TeraCopy230/Options.ini: Found 2 duplicates - deleting identical copies
2020/06/14 00:39:11 NOTICE: TeraCopy230/Transfer.log: Found 2 duplicates - deleting identical copies
2020/06/14 00:39:11 DEBUG : 18 go routines active

This is a known issue with skip:

Ah ok, this must be a new bug then, as I have only just noticed that it does it!

So, dedupe is broken only when using with Crypt?

I just replied to the issue. I don't think dedupe is broken, I think the logging is!

Plus the fact that identical crypt files will have different MD5SUMs as the encryption chooses a random nonce each time.

It's probably worth making the documentation a bit clearer as you really want to always run dedupe on a non encrypted remote. If you have a true duplicate (same MD5sum), it would work to clean it up as that it how skip works.

If you use newest, it does work on the crypt, but I'm still unsure how anyone ever gets duplicates to be honest as I've never had that. The only way I can get them is if I rename something on the Drive WebApp to make the issue.

I'm not using Crypt in the above example.

The same applies. The same file names do not equal the same files for skip as it has to be the same MD5SUM.

felix@gemini:~$ rclone md5sum GD:test
157c632175c4251012e651322736f0ae  dupe.mkv
9d462dd04bf1f85ed21f8d9a51c709eb  dupe.mkv

felix@gemini:~$ rclone dedupe GD:test -vv --dedupe-mode skip
2020/06/14 08:04:47 DEBUG : rclone: Version "v1.52.1" starting with parameters ["rclone" "dedupe" "GD:test" "-vv" "--dedupe-mode" "skip"]
2020/06/14 08:04:47 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/06/14 08:04:47 INFO  : Google drive root 'test': Looking for duplicates using skip mode.
2020/06/14 08:04:48 NOTICE: dupe.mkv: Found 2 duplicates - deleting identical copies
2020/06/14 08:04:48 DEBUG : 4 go routines active
felix@gemini:~$

So it never would delete them.

But something like newest would:

felix@gemini:~$ rclone dedupe GD:test -vv --dedupe-mode newest
2020/06/14 08:05:56 DEBUG : rclone: Version "v1.52.1" starting with parameters ["rclone" "dedupe" "GD:test" "-vv" "--dedupe-mode" "newest"]
2020/06/14 08:05:56 DEBUG : Using config file from "/opt/rclone/rclone.conf"
2020/06/14 08:05:58 INFO  : Google drive root 'test': Looking for duplicates using newest mode.
2020/06/14 08:06:00 NOTICE: dupe.mkv: Found 2 duplicates - deleting identical copies
2020/06/14 08:06:01 INFO  : dupe.mkv: Deleted
2020/06/14 08:06:01 NOTICE: dupe.mkv: Deleted 1 extra copies
2020/06/14 08:06:01 DEBUG : 4 go routines active

Oh I know that already, I wasn't meaning to suggest it should delete duplicates based on the filename.

I was responding to Harry as he said "So, dedupe is broken only when using with Crypt?"

I wasn't responding to you or anyone else, sorry for the confusion.

That is a good idea as it will clear any identical files. Though I'm not sure that you get then with crypt. It thinks dupes are caused by an intermittent bug in drive which doesn't show recently uploaded files in the listing causing them to be uploaded again

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.