Unique Unicode Characters lead to Duplicate object found

What is the problem you are having with rclone?

When syncing a local directory that contains two nearly-identically named files to Backblaze B2 cloud, I receive a Duplicate object found in source - ignoring notice.

I've read many forum posts and GitHub issues that relate to issues with Unicode characters, but none that explain this specific phenomenon. It is possible that the issue is related to Backblaze, but I wondered if maybe there was an obvious reason that rclone is failing to sync these files as if they were unique.

I've confirmed that my locale is LANG=en_US.UTF-8

The offending characters are é and é which evaluate to 0x65+0x301 and 0xE9 respectively.

My hypothesis is that for some reason rclone is viewing these characters as identical, which prevents it from seeing the filenames as unique.

It could also be my misunderstanding of Unicode/language, in that these characters should indeed be treated as identical, and that it is a failure of our organizational methods to eliminate the seemingly duplicate name. I could imagine that difference between é and é would be similar to character case, and should technically be ignored or clobbered in the default case.

What is your rclone version & operating system?

rclone v1.51.0
- os/arch: linux/amd64
- go version: go1.13.7

Which cloud storage system are you using?

Backblaze B2

The command you were trying to run

rclone sync /my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes backblaze:my-bucket-name/my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes --transfers 16 --links --exclude "._*" --exclude ".DS_Store" --progress --local-no-unicode-normalization

I tried this with and without --local-no-unicode-normalization

A log from the command with the -vv flag

2020/05/12 02:52:38 ERROR : The --local-no-unicode-normalization flag is deprecated and will be removed
2020-05-12 02:52:40 NOTICE: 2009.06.18.22.48-DREDACTEDr_Résumé.doc: Duplicate object found in source - ignoring
2020-05-12 02:52:40 NOTICE: 2009.06.30.21.49-SREDACTEDé PREDACTEDe.doc: Duplicate object found in source - ignoring
2020-05-12 02:52:40 INFO  : B2 bucket my-bucket-name path my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes: Waiting for checks to finish
2020-05-12 02:52:40 DEBUG : 2009.01.17.04.25-BREDACTEDo_2008_resume.pdf: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.01.17.04.25-BREDACTEDo_2008_resume.pdf: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.06.18.22.48-DREDACTEDr_Résumé.doc: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.06.18.22.48-DREDACTEDr_Résumé.doc: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.06.30.21.49-SREDACTEDé PREDACTEDe.doc: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.06.30.21.49-SREDACTEDé PREDACTEDe.doc: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.07.16.10.54-ZREDACTEDe.pdf: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.07.16.10.54-ZREDACTEDe.pdf: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.07.20.17.37-JREDACTEDs-Resume.jpg: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.07.20.17.37-JREDACTEDs-Resume.jpg: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.07.22.16.12-AREDACTEDrt resume09.doc: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.07.22.16.12-AREDACTEDrt resume09.doc: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.07.23.22.02-PREDACTEDiResume.pdf: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.07.23.22.02-PREDACTEDiResume.pdf: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.11.07.22.28-RREDACTEDs \'09-B.ai: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.11.07.22.28-RREDACTEDs \'09-B.ai: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.11.07.22.28-RREDACTEDs '09-B.ai: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.11.07.22.28-RREDACTEDs '09-B.ai: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.11.11.14.49-SREDACTEDe\'s Resume.doc: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.11.11.14.49-SREDACTEDe\'s Resume.doc: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.11.11.14.49-SREDACTEDe's Resume.doc: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.11.11.14.49-SREDACTEDe's Resume.doc: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.12.09.14.55-CREDACTEDa_resume_2009_6.pdf: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.12.09.14.55-CREDACTEDa_resume_2009_6.pdf: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.12.11.13.53-JREDACTEDs_Resume.pdf: Size and modification time the same (differ by 0s, within tolerance 1ms)
2020-05-12 02:52:40 DEBUG : 2009.12.11.13.53-JREDACTEDs_Resume.pdf: Unchanged skipping
2020-05-12 02:52:40 DEBUG : 2009.11.14.23.27-AREDACTEDa(Resume).rtf: Unchanged skipping
2020-05-12 02:52:40 INFO  : B2 bucket my-bucket-name path my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes: Waiting for transfers to finish
2020-05-12 02:52:40 INFO  : Waiting for deletions to finish
2020/05/12 02:52:40 DEBUG : 7 go routines active

I'm pretty sure rclone does some normalization of unicode as there are a number of fixes/issues related to them.

Can you just run a rclone ls on the source where the issue is rather than trying to copy it? What does the local listing look like when comparing?

They evaluate to separate names locally:

rclone ls /my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes/ | grep 2009.06.18
    52736 2009.06.18.22.48-DREDACTEDr_Résumé.doc
    52736 2009.06.18.22.48-DREDACTEDr_Résumé.doc

So that sounds like B2 is doing something with the translation.

You could try with --backend-encoding None and see how that works or play more around with the encoding and see if something works too.

I'm not a good unicode person as my process to use none and I try avoid spaces too :slight_smile:

https://rclone.org/overview/#encoding

I configured my backend with encoding = None. That didn't affect the clobber.

Then I tested using the standard Backblaze B2 CLI, and saw that the files are uploaded as-is, so I think that confirms the bug is with rclone.

Each file uploaded unchanged, and the JSON output echos the Unicode characters I referenced above:

b2 upload_file my-test-bucket /my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes/2009.06.18.22.48-DREDACTEDr_Résumé.doc my-test-folder/2009.06.18.22.48-DREDACTEDr_Résumé.doc
b2 upload_file my-test-bucket /my_server/archive/_protected/_webtools/_archive/recruiting.old/resumes/2009.06.18.22.48-DREDACTEDr_Résumé.doc my-test-folder/2009.06.18.22.48-DREDACTEDr_Résumé.doc

Retrieving the contents of the directory using the Backblaze API confirms this as well:

b2 list-file-names my-test-bucket my-test-folder
{
 "files": [
   {
     "accountId": "*****", "action": "upload", "bucketId": "*****", "contentLength": 52736,
     "contentMd5": "b6ae2eae2e933004147bcf882803e3ed", "contentSha1": "c51e7f7a65516290a7e6e6eba3ddfe39b46502be",
     "contentType": "application/msword",
     "fileId": "4_z*****_f109f56e16d8e404e_d20200512_m223001_c000_v0001069_t0041",
     "fileInfo": { "src_last_modified_millis": "1271790530000" },
     "fileName": "my-test-folder/2009.06.18.22.48-DREDACTEDr_Re\u0301sume\u0301.doc",
     "uploadTimestamp": 1589322601000
   },
   {
     "accountId": "*****", "action": "upload", "bucketId": "*****", "contentLength": 52736,
     "contentMd5": "b6ae2eae2e933004147bcf882803e3ed", "contentSha1": "c51e7f7a65516290a7e6e6eba3ddfe39b46502be",
     "contentType": "application/msword",
     "fileId": "4_z*****_f1031545d9b011da0_d20200512_m224224_c000_v0001066_t0012",
     "fileInfo": { "src_last_modified_millis": "1271822930000" },
     "fileName": "my-test-folder/2009.06.18.22.48-DREDACTEDr_R\u00e9sum\u00e9.doc",
     "uploadTimestamp": 1589323344000
   }
 ],
 "nextFileName": null
}

That is correct. When doing a sync rclone does a unicode normalization on the strings.

This isn't part of the bakend encoding it is part of the sync routine.

These are almost always generated by macOS which stores the decomposed form as its filenames.

Some cloud storage systems will recompose the unicode which is why rclone has that normalization step in the sync routines.

This normally saves people on macOS from having duplicate files like the ones you have!

To avoid this we'd need another flag to skip the normalization in the sync routines.

This would be straight forward to implement if you wanted to help.

Aha! We have our explanation.

I'd love to help implement. I'll move this into a github issue, and start digging.

1 Like

Great! Please make a new issue on github and I'll give you some hints if you need them! (The code is in fs/walk/walk.go I think)

The issue was fixed by adding --no-unicode-normalization in the following pull request. This is currently only available in the beta.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.