Rclone dedupe and --checksum

Does using --checksum flag with rclone dedupe do anything at all?

I tried it out of curiosity and it seemed to not to, but I have no idea at all really what goes on in the background.

Dedupe is just looking for files with the same exact name in the same path. I don’t believe checksum will do anything with dedupe.

Oh the exact same name only, not the exact same size and name.

I misunderstood, it would have been handy in some obscure use case to have it find duplicates that had the same checksum. Could then find if there were duplicates but not necessarily in the same folder.

You can do that manually with lsjson and script it. Pull down a full listing of checksums and paths and sort on the chksum values to pull duplicates.

Oh I see, I will have a look just to curb my curiosity.

It wouldn’t sadly be as one line as in if the dedupe command could do it.

Just to add that everything that has been said, dedupe basically exists for Google Drive only as it is the only service that lets you have files with the same names in the same directories.

From the docs:

By default dedupe interactively finds duplicate files and offers to delete all but one or rename them to be different. Only useful with Google Drive which can have duplicate file names.

This is this issue which I agree it would be nice to have. Fancy working on it?

1 Like

@ncw Ok great, I do fancy working on that yes, but a couple of questions as I haven’t used or had the chance to look at Go yet.

Where abouts do you suggest I start, I don’t mean with Go itself as you answered that question already in another topic, I mean about working on this issue specifically.

Great!

The algorithm should be something like

  • pick a supported hash for the remote - give up if none found
  • iterate through the entire file system
  • make a map with hash to slice of fs.Object
  • when the iteration is complete, then look through the map to find the duplicates (show by more than one fs.Object in the slice)

If you look at operations.Dedupe you’ll see it has a very similar structure to the above, except the map it builds is keyed on filename rather than hash.

Let’s assume for a moment that is is implemented by a flag say --by-hash, then if this was set it would need to

Though some thought would be needed for the remote parameter passed to the sub actions - eg dedupeRename - this should probably be removed and the sub actions should use Object.Remote() where appropriate.

Also the ‘default’ for dedupe is to remove duplicates non-interactively if they match are real duplicates. The default for this probably should be less impactful since it will potentially more destructive as they are not really ‘duplicates’ in that sense of the word. Thoughts?

The default is interactive. However we should remove the part where it removes identical ones - that is what I was getting at with

Ah yes. That’s the aspect I was thinking of.

Sure I agree with everything you’ve mentioned, I’ll have a go, I may make some slow progress due to needing to get the hang of Go and learn more about Rclone too!

Look forward to seeing the result! Let me know if you want more help.