Rclone delete all duplicates by hash skipping duplicates

I have an rclone dedupe command which works perfectly fine. However if I make an exact copy of that command and add the flag --by-hash to it, it doesn't remove duplicates anymore.

I have some duplicates in one folder, but the names of the files are all different but when comparing the hashes there are duplicate hashes with different filenames, I'd like to be able to run a dedupe command and remove duplicate hashes regardless if the filenames are the same or different.

When I try this it mentions duplicate hashes exist but it just skips them all, when I do interactive it gives me the option to pick which I'd like to keep/delete, but I am not really that bothered would like rclone to just remove all but one as it does without --by-hash

Can you post the command you ran and the output it gave please? rclone dedupe --by-hash should be working as you wish...

@ncw I'd forgot I had --dry-run on my command when I was running it previously, but the reason why I had forgot was because this wasn't mentioned in the log and it didn't say x would have been deleted if --dry-run wasn't on or words to that effect.

The information below was whilst --dry-run was on, and the log file also is from when --dry-run was on.

What is the problem you are having with rclone?

rclone dedupe --by-hash skips deleting duplicate hashes

What is your rclone version (output from rclone version)

rclone v1.55.0
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Distributor ID: Debian
Description:    Debian GNU/Linux 9.13 (stretch)
Release:        9.13
Codename:       stretch

Linux ml110-1 4.9.0-14-amd64 #1 SMP Debian 4.9.246-2 (2020-12-17) x86_64 GNU/Linux

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

/usr/bin/rclone dedupe skip GD:/ECS --dry-run --buffer-size 500M --by-hash --checkers 7 --check-first --checksum --drive-acknowledge-abuse --drive-chunk-size 8M --drive-pacer-min-sleep 100ms --fast-list --log-level DEBUG --low-level-retries 9999 --retries 9999 --retries-sleep 2s --stats 0 --tpslimit 7 --tpslimit-burst 7 --transfers 7 --use-mmap --user-agent 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'

The rclone config contents with secrets removed.

[GD]
type = drive
client_id = ***REDACTED***.apps.googleusercontent.com
client_secret = ***REDACTED***
token = {"access_token":"***REDACTED***","token_type":"Bearer","refresh_token":"***REDACTED***","expiry":"2021-04-23T17:56:39.251992879+01:00"}
root_folder_id = ***REDACTED***

A log from the command with the -vv flag

https://paste.ee/p/WK5zf

It says

2021/04/23 17:44:40 NOTICE: d01a2644947f55d3bf621bcb98ce05a3: Skipping 2 files with duplicate MD5 hashes
2021/04/23 17:44:40 NOTICE: c88d7c58edf485b28a0bd688634429ef: Found 2 files with duplicate MD5 hashes

It doesn't get as far as the delete stage as it would normally ask the user what to do at this point I think.

Hopefully it works without --dry-run?

You can always try it with -i which will allow you to confirm each action

@ncw Oh yes it does work in interactive, but I was wanting to run it as --dry-run first and for it to tell me what would normally be deleted like it would do with rclone copy or rclone sync when using --dry-run with those commands it tells you what will be transferred/synced/deleted. Could the dedupe command tell us what will be deleted when using --dry-run please, similar to how the rclone copy and rclone sync commands do?

I wanted to run the rclone dedupe skip --by-hash unattended as I don't mind which file is deleted as long as one remains.

You've put dedupe skip in the command line - this will skip all the deletions as the files aren't identical (the names are different).

If you use dedupe newest with --dry-run it will do what you want I think.

@ncw but I have --by-hash this should silently override the comparision, and ignore the filenames and instead use the hashes of the files so that skip works, the documentation isn't great on the use of --by-hash makes no mention that skip doesn't work with that flag.

I will try newest but it's an ugly hack, better to fix the problem in my opinion.

The documentation could certainly do with improvement!

This section could do with a clarification that "removes identical files" means files with the same name and contents and isn't in effect when using --by-hash.

I don't think skip should be deleting files with --by-hash - the user needs to choose which file they want to keep. Remember they might be in completely different directories or have completely different names.

See my thoughts are that I think it should, in my case what you said is exactly what I have. I have files with duplicate hashes under different filenames, and also in different directories. I'm happy for the duplicates to be just deleted unattended.

Also newest did what I wanted, just a shame skip isn't able to do that too!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.