Rclone delete all duplicates by hash skipping duplicates

AeroMaxx · April 23, 2021, 7:46am

I have an rclone dedupe command which works perfectly fine. However if I make an exact copy of that command and add the flag --by-hash to it, it doesn't remove duplicates anymore.

I have some duplicates in one folder, but the names of the files are all different but when comparing the hashes there are duplicate hashes with different filenames, I'd like to be able to run a dedupe command and remove duplicate hashes regardless if the filenames are the same or different.

When I try this it mentions duplicate hashes exist but it just skips them all, when I do interactive it gives me the option to pick which I'd like to keep/delete, but I am not really that bothered would like rclone to just remove all but one as it does without --by-hash

ncw · April 23, 2021, 2:11pm

Can you post the command you ran and the output it gave please? rclone dedupe --by-hash should be working as you wish...

AeroMaxx · April 23, 2021, 7:31pm

@ncw I'd forgot I had --dry-run on my command when I was running it previously, but the reason why I had forgot was because this wasn't mentioned in the log and it didn't say x would have been deleted if --dry-run wasn't on or words to that effect.

The information below was whilst --dry-run was on, and the log file also is from when --dry-run was on.

What is the problem you are having with rclone?

rclone dedupe --by-hash skips deleting duplicate hashes

What is your rclone version (output from `rclone version`)

rclone v1.55.0
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Distributor ID: Debian
Description:    Debian GNU/Linux 9.13 (stretch)
Release:        9.13
Codename:       stretch

Linux ml110-1 4.9.0-14-amd64 #1 SMP Debian 4.9.246-2 (2020-12-17) x86_64 GNU/Linux

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

/usr/bin/rclone dedupe skip GD:/ECS --dry-run --buffer-size 500M --by-hash --checkers 7 --check-first --checksum --drive-acknowledge-abuse --drive-chunk-size 8M --drive-pacer-min-sleep 100ms --fast-list --log-level DEBUG --low-level-retries 9999 --retries 9999 --retries-sleep 2s --stats 0 --tpslimit 7 --tpslimit-burst 7 --transfers 7 --use-mmap --user-agent 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'

The rclone config contents with secrets removed.

[GD]
type = drive
client_id = ***REDACTED***.apps.googleusercontent.com
client_secret = ***REDACTED***
token = {"access_token":"***REDACTED***","token_type":"Bearer","refresh_token":"***REDACTED***","expiry":"2021-04-23T17:56:39.251992879+01:00"}
root_folder_id = ***REDACTED***

A log from the command with the `-vv` flag

https://paste.ee/p/WK5zf

ncw · April 24, 2021, 10:21pm

It says

2021/04/23 17:44:40 NOTICE: d01a2644947f55d3bf621bcb98ce05a3: Skipping 2 files with duplicate MD5 hashes
2021/04/23 17:44:40 NOTICE: c88d7c58edf485b28a0bd688634429ef: Found 2 files with duplicate MD5 hashes

It doesn't get as far as the delete stage as it would normally ask the user what to do at this point I think.

Hopefully it works without --dry-run?

You can always try it with -i which will allow you to confirm each action

AeroMaxx · April 25, 2021, 5:32am

@ncw Oh yes it does work in interactive, but I was wanting to run it as --dry-run first and for it to tell me what would normally be deleted like it would do with rclone copy or rclone sync when using --dry-run with those commands it tells you what will be transferred/synced/deleted. Could the dedupe command tell us what will be deleted when using --dry-run please, similar to how the rclone copy and rclone sync commands do?

I wanted to run the rclone dedupe skip --by-hash unattended as I don't mind which file is deleted as long as one remains.

ncw · April 25, 2021, 10:22am

You've put dedupe skip in the command line - this will skip all the deletions as the files aren't identical (the names are different).

If you use dedupe newest with --dry-run it will do what you want I think.

AeroMaxx · April 25, 2021, 10:27am

@ncw but I have --by-hash this should silently override the comparision, and ignore the filenames and instead use the hashes of the files so that skip works, the documentation isn't great on the use of --by-hash makes no mention that skip doesn't work with that flag.

I will try newest but it's an ugly hack, better to fix the problem in my opinion.

ncw · April 25, 2021, 10:44am

The documentation could certainly do with improvement!

This section could do with a clarification that "removes identical files" means files with the same name and contents and isn't in effect when using --by-hash.

github.com

rclone/rclone/blob/89daa9efd1ce206931fa58b2a7374fc791ada67b/cmd/dedupe/dedupe.go#L117


      
              two-3.txt: renamed from: two.txt
          
          The result being
          
              $ rclone lsl drive:dupes
                6048320 2016-03-05 16:23:16.798000000 one.txt
                 564374 2016-03-05 16:22:52.118000000 two-1.txt
                6048320 2016-03-05 16:22:46.185000000 two-2.txt
                1744073 2016-03-05 16:22:38.104000000 two-3.txt
          
          Dedupe can be run non interactively using the ` + "`" + `--dedupe-mode` + "`" + ` flag or by using an extra parameter with the same value
          
            * ` + "`" + `--dedupe-mode interactive` + "`" + ` - interactive as above.
            * ` + "`" + `--dedupe-mode skip` + "`" + ` - removes identical files then skips anything left.
            * ` + "`" + `--dedupe-mode first` + "`" + ` - removes identical files then keeps the first one.
            * ` + "`" + `--dedupe-mode newest` + "`" + ` - removes identical files then keeps the newest one.
            * ` + "`" + `--dedupe-mode oldest` + "`" + ` - removes identical files then keeps the oldest one.
            * ` + "`" + `--dedupe-mode largest` + "`" + ` - removes identical files then keeps the largest one.
            * ` + "`" + `--dedupe-mode smallest` + "`" + ` - removes identical files then keeps the smallest one.
            * ` + "`" + `--dedupe-mode rename` + "`" + ` - removes identical files then renames the rest to be different.
            * ` + "`" + `--dedupe-mode list` + "`" + ` - lists duplicate dirs and files only and changes nothing.

I don't think skip should be deleting files with --by-hash - the user needs to choose which file they want to keep. Remember they might be in completely different directories or have completely different names.

AeroMaxx · April 25, 2021, 11:17am

See my thoughts are that I think it should, in my case what you said is exactly what I have. I have files with duplicate hashes under different filenames, and also in different directories. I'm happy for the duplicates to be just deleted unattended.

Also newest did what I wanted, just a shame skip isn't able to do that too!