Improved messages on duplicate files during copy/sync/move

If a copy/sync/move encounters a duplicate file name, it currently results in successful completion with a notification like this:

NOTICE: somefile.txt: Duplicate object found in source - ignoring

This is a bit cryptic, might not be noticed, and doesn't flag a potential data loss.

I therefore propose we change this to an error with a text like this:

ERROR:  somefile.txt: Duplicate file name found in source. Content might differ. Use rclone dedupe to find and fix duplicate names

The proposal is based on the discussions in this post:

I have a local branch ready to make a PR, if everybody likes the proposal.

1 Like

This seems like a good idea to me.

Should this also fail the sync? As in cause revived to return a non zero error code?

I think that would be too backwards incompatible but I thought I'd mention it for discussion.

My reasoning for failure.

2 duplicates (A1 / A2).

I move/sync one pass, A2 goes over and A2 is deleted from the source. Soft error that no one sees.
I move/sync next pass, no duplicates, A1 overwrites A2 and A2 on the destination is lost replaced by A1.

You've lost data silently without knowing it.

I'd also argue that duplicates should not be a thing as I can't imagine how/why people are creating them but they do seem to be.

If I was concerned about my data, I'd abort on duplicates till the source was sorted.

how about a new flag, based on --error-on-no-transfer

--error-on-duplicates

By default, rclone will exit with return code 0 if duplicate files are found.

This option allows rclone to return exit code 10 if duplicate files are found.
This allows using rclone in scripts, and triggering follow-on actions if duplicates are found.

NB: Enabling this option turns a usually non-fatal error notice into a potentially fatal one.
Please check and adjust your scripts accordingly!

That's quite a strong argument for aborting the sync straight away if a duplicate is found.

I guess there are 3 options

  1. Print ERROR message (currently a NOTICE) and carry on. At the end rclone will return a 0 exit code. This is what happens at the moment.
  2. Print ERROR, mark the sync as failed, and carry on. At the end rclone will return a non-zero exit code.
  3. Return a fatal error - this will stop the sync immediately, log an ERROR and return with a non-zero exit code.

I was thinking 2 but maybe it could be configurable with --error-on-duplicates off|atexit|fatal or something like that?

The default could be fatal and the error could instruct the user on how to fix it either with dedupe or the flag?

I like the fail fast approach in option 3. and suggest a boolean flag called --ignore-duplicates which reverts to current behavior.

However not sure I know enough about duplicates and the relevant backends. Are there situations where duplicates are to be expected? e.g. Google Photos?

Check out -> Overview of cloud storage systems The 'duplicates' field :slight_smile:

That I know :grinning: It was more like the full depth of https://rclone.org/googlephotos/#duplicates :thinking:

Good idea and very clear.

Google Photos has code to rename the duplicates (which could potentially be used in drive).

The only situation where duplicates might be OK is if syncing from a Google Drive to another Google Drive - say. I don't think this works very well at the moment though as which file gets synced to which file isn't well defined (there is an issue about this I'm sure).

So yes, lets go for fail fast with --ignore-duplicates to turn it off. @Ole would you like to make an issue about this?

I accidently posted this in wrong section so reposting it here:

To me I see dupes as 2 issues.

Scenario1:

There are times when you do not want to overwrite or delete the source if there are dupes so you either want a report or to stop the process. In this instance I see it more as if duplicate found "skip" the file/folder and continue.

So in this example say remote1:/FolderName and remote2:/FolderName have the same subfolder names, it would not SKIP them or Delete them automatically, it would check the contents of each folder, where dupes were found (meaning filename, and size) it would not move it or delete the source, it would only move files from source that were not in the destination. If the source folder was empty then rclone can delete the source folders using the flag for it.

By it not deleting matching source files/folders the source would only have files in it that matched the destination.

So when rclone is complete a person could then inspect the source to see what is left inside. The key here is that the source and destination have been synced but the dupes have been left alone for further investigation into the files.

Scenario 2:
I come across an issue where folders are appearing like this:
FolderName
FolderName(1)
FolderName(2)

The folder names are basically supposed to have the same contents, but they actually do not. Meaning there could be files in each folder missing from the other one. So 1 folder has 10 files, another 20, and another 50. But they are all supposed to be inside 1 folder not 3. There can be duplicates though in each folder.

This is where the check-dupes comes into play. Unlike Scenario 1, with this scenario you want to delete the source folder/files if everything is the same, only leaving files in the source where the filenames and sizes do NOT match.

The intent of check-dupes is the source and destination are supposed to be the same folder and files names, just where something broke on Gdrive and allowed multiple folders to be created.

so maybe flags like this just as an idea

skip-dupes when invoked checks every folder/file and only moves folders/files not matching names and sizes in the source

check-dupes does the opposite, it says if files match names and size delete the source and merge into FolderName

These 2 different flags make it possible for people to handle duplicates in the way they desire.

One way dupes happen on Gdrive is using the WebUI to move files or folders, if someone moves a folder from 1 area/drive to another and the folder already exists it still moves it with the same name instead of merging. For GDrive this is where the (1) (2) etc comes into play most of the time.