Default Enable --no-unicode-normalization for stickyimport and import backends

lxy · November 15, 2024, 5:06am

What is the problem you are having with rclone?

When using the stickyimport and import backends, the default Unicode normalization causes issues during file comparisons. This results in certain files not being properly compared, leading to functionality failure in scenarios that require exact file matching. The default behavior interferes with accurate file comparisons when the filenames contain specific characters, causing them to not match as expected.

Run the command 'rclone version' and share the full output of the command.

rclone v1.68.1

os/version: slackware 15.0+ (64 bit)
os/kernel: 6.1.106-Unraid (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.23.1
go/linking: static
go/tags: none

Which cloud storage system are you using? (eg Google Drive)

WebDAV

The command you were trying to run (eg `rclone copy /tmp remote:tmp`)

rclone backend stickyimport Hasher1: md5 hasher1.md5
rclone check Hasher1: Hasher2: --checksum
rclone --no-unicode-normalization backend stickyimport Hasher1: md5 hasher1.md5

The rclone config contents with secrets removed.

[Hasher1]
type = hasher
remote = AList:/
hashes = md5
max_age = off

[Hasher2]
type = hasher
remote = /mnt/disk1/
hashes = md5
max_age = off

A log from the command with the `-vv` flag

NOTICE: hasher::Hasher1:: 4 hashes could not be checked

asdffdsa · November 15, 2024, 3:49pm

welcome to the forum,

is there a bug or are you asking for a new feature/change of default behavior?

lxy · November 16, 2024, 4:24pm

Thank you for the warm welcome!
This issue occurs because some Japanese filenames are not normalized. During import, they are saved in a normalized form to the backend. However, a non-normalized filename is different from its normalized counterpart. As a result, MD5 comparison fails to locate the corresponding file, leading to a mismatch. This issue can only be resolved by forcibly enabling the --no-unicode-normalization option. Therefore, I believe this should be classified as a bug.

lxy · November 16, 2024, 4:37pm

For example, when importing \u30D5\u3099 with the default options, it is stored as the normalized form \u30D6. At this point, the record becomes invalid and cannot be matched. Only by adding the --no-unicode-normalization option does the record remain valid and can be matched correctly.