Improved messages on duplicate files during copy/sync/move

If a copy/sync/move encounters a duplicate file name, it currently results in successful completion with a notification like this:

NOTICE: somefile.txt: Duplicate object found in source - ignoring

This is a bit cryptic, might not be noticed, and doesn't flag a potential data loss.

I therefore propose we change this to an error with a text like this:

ERROR:  somefile.txt: Duplicate file name found in source. Content might differ. Use rclone dedupe to find and fix duplicate names

The proposal is based on the discussions in this post:

I have a local branch ready to make a PR, if everybody likes the proposal.

1 Like

This seems like a good idea to me.

Should this also fail the sync? As in cause revived to return a non zero error code?

I think that would be too backwards incompatible but I thought I'd mention it for discussion.

My reasoning for failure.

2 duplicates (A1 / A2).

I move/sync one pass, A2 goes over and A2 is deleted from the source. Soft error that no one sees.
I move/sync next pass, no duplicates, A1 overwrites A2 and A2 on the destination is lost replaced by A1.

You've lost data silently without knowing it.

I'd also argue that duplicates should not be a thing as I can't imagine how/why people are creating them but they do seem to be.

If I was concerned about my data, I'd abort on duplicates till the source was sorted.

how about a new flag, based on --error-on-no-transfer

--error-on-duplicates

By default, rclone will exit with return code 0 if duplicate files are found.

This option allows rclone to return exit code 10 if duplicate files are found.
This allows using rclone in scripts, and triggering follow-on actions if duplicates are found.

NB: Enabling this option turns a usually non-fatal error notice into a potentially fatal one.
Please check and adjust your scripts accordingly!

That's quite a strong argument for aborting the sync straight away if a duplicate is found.

I guess there are 3 options

  1. Print ERROR message (currently a NOTICE) and carry on. At the end rclone will return a 0 exit code. This is what happens at the moment.
  2. Print ERROR, mark the sync as failed, and carry on. At the end rclone will return a non-zero exit code.
  3. Return a fatal error - this will stop the sync immediately, log an ERROR and return with a non-zero exit code.

I was thinking 2 but maybe it could be configurable with --error-on-duplicates off|atexit|fatal or something like that?

The default could be fatal and the error could instruct the user on how to fix it either with dedupe or the flag?

I like the fail fast approach in option 3. and suggest a boolean flag called --ignore-duplicates which reverts to current behavior.

However not sure I know enough about duplicates and the relevant backends. Are there situations where duplicates are to be expected? e.g. Google Photos?

Check out -> Overview of cloud storage systems The 'duplicates' field :slight_smile:

That I know :grinning: It was more like the full depth of https://rclone.org/googlephotos/#duplicates :thinking:

Good idea and very clear.

Google Photos has code to rename the duplicates (which could potentially be used in drive).

The only situation where duplicates might be OK is if syncing from a Google Drive to another Google Drive - say. I don't think this works very well at the moment though as which file gets synced to which file isn't well defined (there is an issue about this I'm sure).

So yes, lets go for fail fast with --ignore-duplicates to turn it off. @Ole would you like to make an issue about this?

I accidently posted this in wrong section so reposting it here:

To me I see dupes as 2 issues.

Scenario1:

There are times when you do not want to overwrite or delete the source if there are dupes so you either want a report or to stop the process. In this instance I see it more as if duplicate found "skip" the file/folder and continue.

So in this example say remote1:/FolderName and remote2:/FolderName have the same subfolder names, it would not SKIP them or Delete them automatically, it would check the contents of each folder, where dupes were found (meaning filename, and size) it would not move it or delete the source, it would only move files from source that were not in the destination. If the source folder was empty then rclone can delete the source folders using the flag for it.

By it not deleting matching source files/folders the source would only have files in it that matched the destination.

So when rclone is complete a person could then inspect the source to see what is left inside. The key here is that the source and destination have been synced but the dupes have been left alone for further investigation into the files.

Scenario 2:
I come across an issue where folders are appearing like this:
FolderName
FolderName(1)
FolderName(2)

The folder names are basically supposed to have the same contents, but they actually do not. Meaning there could be files in each folder missing from the other one. So 1 folder has 10 files, another 20, and another 50. But they are all supposed to be inside 1 folder not 3. There can be duplicates though in each folder.

This is where the check-dupes comes into play. Unlike Scenario 1, with this scenario you want to delete the source folder/files if everything is the same, only leaving files in the source where the filenames and sizes do NOT match.

The intent of check-dupes is the source and destination are supposed to be the same folder and files names, just where something broke on Gdrive and allowed multiple folders to be created.

so maybe flags like this just as an idea

skip-dupes when invoked checks every folder/file and only moves folders/files not matching names and sizes in the source

check-dupes does the opposite, it says if files match names and size delete the source and merge into FolderName

These 2 different flags make it possible for people to handle duplicates in the way they desire.

One way dupes happen on Gdrive is using the WebUI to move files or folders, if someone moves a folder from 1 area/drive to another and the folder already exists it still moves it with the same name instead of merging. For GDrive this is where the (1) (2) etc comes into play most of the time.

I think you should make a separate feature request for this because my proposal in this thread only aims at making rclone warn/error and skip/stop when duplicates are met.

Your request is far more ambitious and will require major changes as explained here:
https://forum.rclone.org/t/moving-files-from-one-google-workspace-shared-drive-to-another-possible-bug/35585/29

@Ole do you want to make your proposal into an issue? I think it is a very good start.

Hi Nick,

I would very much like rclones handling of duplicate names to be predictable, visible and error out whenever there is a risk of unexpected data loss. I thought this was already general approach except for the notifications during sync, that is why I made this feature proposal.

While testing my fix I, however, realized that there are many situations where duplicates names are being ignored without notice and without a common set of rules.

I therefore consider it misleading to introduce a flag like --ignore-duplicates, because it would give a wrong impression of rclones behavior in all the situations where the flag isn’t used e.g. rclone cat remote:hello

Taking all of this into account I propose a revised change that is more in line with rclones current behavior, which is an improved wording of the current NOTICE to something like this:

NOTICE: someDuplicateName: Duplicate name found in source - ignoring %type%. Use rclone dedupe to find and fix duplicate names

and a fix of the duplicate name comparison in copy/sync/move to detect when a file and folder has the same name.

Here is a small example showing both in action:

> echo "Hello Duplicate File!"   > ./testfolder1/hello
> echo "Hello Duplicate Folder!" > ./testfolder2/hello/duplicates

> rclone copy ./testfolder1 remote: -v --stats=0
2023/02/01 12:02:38 INFO  : hello: Copied (new)

> rclone copy ./testfolder2 remote: -v --stats=0
2023/02/01 12:02:44 INFO  : hello/duplicates: Copied (new)

> rclone lsl remote:
    23 2023-02-01 12:02:29.539000000 hello
    25 2023-02-01 12:02:32.135000000 hello/duplicates

> rclone copy remote: ./testfolder3 -v --stats=0
2023/02/02 14:40:32 NOTICE: hello: Duplicate name found in source - ignoring file. Use rclone dedupe to find and fix duplicate names
2023/02/02 14:40:33 INFO  : hello/duplicates: Copied (new)

Very nice :slight_smile:

I wonder if this might break existing workflows for instance copying from drive -> drive or s3 -> s3?

It will in this very specific situation, but by doing this we allow all other workflows to succeed as intended without this hard to guess error:

> rclone copy remote: ./testfolder3 -v --stats=0 --retries=1
2023/02/02 16:03:45 INFO  : hello: Copied (new)
2023/02/02 16:03:45 ERROR : hello/duplicates: Failed to copy: mkdir ...\testfolder3\hello: The system cannot find the path specified.
2023/02/02 16:03:45 ERROR : Local file system at .../testfolder3: not deleting files as there were IO errors
2023/02/02 16:03:45 ERROR : Local file system at .../testfolder3: not deleting directories as there were IO errors
2023/02/02 16:03:45 ERROR : Attempt 1/1 failed with 1 errors and: mkdir ...\testfolder3\hello: The system cannot find the path specified.

I made the decision after seeing how the code to sort the directory listing explicitly ensured priority for directories in case of name clashes like this, so I think it is a bug. My proposed fix is to remove the comparison of DirEntryType in this line.

Seems reasonable! There is a line further down which needs fixing too. I'd probably put the types of both in the log message to make it very obvious that there is a duplicate file and dir.

Good idea, now looks like this:

> rclone copy remote: ./testfolder3 -v --stats=0
2023/02/03 09:31:12 NOTICE: hello: Duplicate names in source, found a directory and a file - ignoring the last. Use rclone dedupe to find and fix duplicate names
2023/02/03 09:31:13 INFO  : hello/duplicates: Copied (new)

Agree!

Changing it will however break some of the current tests, as it seems to be a reversal of the main purpose of PR #3220 solving a s3 -> gcp sync issue, which probably would have been better solved by a flag to activate the (partial) duplicate support needed to do complete object -> object copies.

My immediate take is that (part of) #3220 should be reversed by this PR. If somebody still needs the object -> object copy functionality then they will have to implement a solution that can also pass tests written for backends without support for duplicates - which seems to be forgotten. Right now it seems like the general functionality has been hijacked by specific s3/object functionality.

How do you think we should handle this?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.