Possible special character encoding issue on macOS

What is the problem you are having with rclone?

Hello!

I seem to hit a special character encoding issue on macOS using a crypto destination.

Copying the same file once with the destination mounted and then using rclone copy yields two files in the encrypted destination and two cleartext files with the same name.

I've put the full repro and verbose logs in the attached text file here.
I will attempt to paste the useful things back into this post as well.
repro-steps.txt (45.6 KB)

Please let me know if I have missed anything.

Thank you for taking the time!

Run the command 'rclone version' and share the full output of the command.

rclone v1.62.2
- os/version: darwin 13.4 (64 bit)
- os/kernel: 22.5.0 (arm64)
- os/type: darwin
- os/arch: arm64 (ARMv8 compatible)
- go/version: go1.20.2
- go/linking: dynamic
- go/tags: cmount 

Which cloud storage system are you using? (eg Google Drive)

Local storage. Crypto destination.

The command you were trying to run (eg rclone copy /tmp remote:tmp)

In the following example directory:

$ find .
.
./test
./test/Tést.txt
./enc
./mnt
./rclone.conf

I copy with rclone

rclone --config rclone.conf copy -vv test crypt:

Then mount and copy with cp

rclone --config rclone.conf mount -vv crypt: mnt

Resulting crypto structure:

$ find enc
enc
enc/4mvk0n0mss2apim05t4a72hbc4
enc/5bqaptil47irc1lcspojffi2gk

Resulting cleartext structure (verbosity trimmed. Please see attached file for full log):

$ rclone --config rclone.conf ls -vv crypt:
        5 Tést.txt
        5 Tést.txt

The rclone config contents with secrets removed.

[crypt]
type = crypt
remote = enc
password = H7D215KjcM7EsNqvjj5i3V2JEy9GCmqiJqEd

A log from the command with the -vv flag

Please see attached file for full log.

There is some issue for sure. I immediately thought about normalization issues - which would be not perfect but is fact of life. However what is worse is that Tést.txt copied to crypt remote using rclone copy not only has different encrypted name but is not visible in mount. All files are visible when doing rclone lsl:

$ rclone lsl crypt:
        4 2023-06-15 07:10:04.652237305 Tést.txt
        4 2023-06-15 07:10:04.652237305 test_copy/Tést.txt
        4 2023-06-15 07:10:04.652237296 test/Tést.txt
$ rclone mount crypt: mount
$ cd mount
$ find .
.
./test
./test/Tést.txt
./test_copy

and actual local remote:

$ find .
.
./jtk5unr795oqf0gk9gfja801o8
./jtk5unr795oqf0gk9gfja801o8/4mvk0n0mss2apim05t4a72hbc4 <------------- `rclone copy`
./7fbqbud3k1fhoeu7a7lc8b5hi4
./7fbqbud3k1fhoeu7a7lc8b5hi4/5bqaptil47irc1lcspojffi2gk <------- `cp in rclone mount`
./4mvk0n0mss2apim05t4a72hbc4  <------------- `rclone copy`

because rclone lsl works without any issues I suspect some problem with macOS fuse - I need to do more test.

Hi @kapitainsky,

Thank you for looking into this issue. I really appreciate it!

I forgot to mention, but you already discovered it, that the reason I started debugging this is indeed that I copied files with rclone copy/sync and they didn't show up when mounted. I was worried about data loss.

Thanks!

you could try to uninstall macOS fuse and use https://www.fuse-t.org/ instead - might be temporary workaround - as I really suspect fuse is a problem here

Nowadays we have two options for mount on macOS.

Good is as lsl shows that no data is lost

I have copied all encrypted remote to linux (Debian) machine and here all works:

remote content:

$ find .
.
./jtk5unr795oqf0gk9gfja801o8
./jtk5unr795oqf0gk9gfja801o8/4mvk0n0mss2apim05t4a72hbc4 <------------- `rclone copy`
./7fbqbud3k1fhoeu7a7lc8b5hi4
./7fbqbud3k1fhoeu7a7lc8b5hi4/5bqaptil47irc1lcspojffi2gk <------- `cp in rclone mount`
./4mvk0n0mss2apim05t4a72hbc4  <------------- `rclone copy`

rclone lsl:

$ rclone lsl crypt:
        4 2023-06-15 09:06:42.002559773 Tést.txt
        4 2023-06-15 09:06:42.890550142 test_copy/Tést.txt
        4 2023-06-15 09:06:42.906549968 test/Tést.txt

rclone mount:

$ rclone mount crypt: mount
$ cd mount
$ find .
.
.
./Tést.txt
./test
./test/Tést.txt
./test_copy
./test_copy/Tést.txt

Thanks for confirming that there is no data loss.

I tested on Windows 11 previously and was unable to reproduce the issue there as well. :+1:

FYI

Looks like this is an issue which surface from time to time:

Different crypto is expected as file names are not the same.... welcome to UNICODE world. There is not one - even if they look the same. More details about this "phenomenon" and related issues can be found here.

And here you are the macOS solution - you have to add -o modules=iconv,from_code=UTF-8,to_code=UTF-8 flag to your mount:

rclone mount crypt: mountPoint -o modules=iconv,from_code=UTF-8,to_code=UTF-8

This is already mentioned in docs. But it seems that nowadays it has to be also added with macFUSE not only with FUSE-T. So maybe it can be made default moving forward.

The lesson here also is that for mission critical data and applications (especially if working cross platforms) it is better to stick to only ASCII characters - this is still reality in 2023.

And not the best new.

with at least macFUSE -o modules=iconv,from_code=UTF-8,to_code=UTF-8 makes files visible in Finder but they are not accessible:

file copied to mount directly:

$ cat Tést.txt
123

the same file copied by rclone copy:

ls -l
total 8
-rw-r--r--  1 kptsky  staff  4 Jun 15 07:10 Tést.txt
drwxr-xr-x  1 kptsky  staff  0 Jun 15 07:13 test
drwxr-xr-x  1 kptsky  staff  0 Jun 15 07:17 test_copy

$ cat Tést.txt
cat: Tést.txt: No such file or directory

so we have an issue with macFUSE

I have uninstalled macFUSE and installed FUSE-T - there is exactly the same problem.

Without -o modules=iconv,from_code=UTF-8,to_code=UTF-8 files copied with rclone copy are not visible. With extra mount flag all files are visible but the same files are not accessible.

rclone mount in macOS seems to be partially broken then:(

Is it possible that rclone copy is actually not doing the right thing?

The original file seems to use c3 a9 for é.

ls test/Tést.txt | hexdump -C
00000000  74 65 73 74 2f 54 c3 a9  73 74 2e 74 78 74 0a     |test/T..st.txt.|
0000000f

If I copy the file in finder I get this: Still c3 a9 for é.

ls mnt/Tést.txt | hexdump -C
00000000  6d 6e 74 2f 54 c3 a9 73  74 2e 74 78 74 0a        |mnt/T..st.txt.|
0000000e

If I then copy the file with rclone copy I get this result: Rclone correctly replaces the file with the identical one.

rclone --config rclone.conf copy -vv test crypt:
2023/06/15 07:50:27 DEBUG : rclone: Version "v1.62.2" starting with parameters ["rclone" "--config" "rclone.conf" "copy" "-vv" "test" "crypt:"]
2023/06/15 07:50:27 DEBUG : Creating backend with remote "test"
2023/06/15 07:50:27 DEBUG : Using config file from "/tmp/repro/rclone.conf"
2023/06/15 07:50:27 DEBUG : fs cache: renaming cache item "test" to be canonical "/tmp/repro/test"
2023/06/15 07:50:27 DEBUG : Creating backend with remote "crypt:"
2023/06/15 07:50:27 DEBUG : Creating backend with remote "enc"
2023/06/15 07:50:27 DEBUG : fs cache: renaming cache item "enc" to be canonical "/tmp/repro/enc"
2023/06/15 07:50:27 DEBUG : Encrypted drive 'crypt:': Waiting for checks to finish
2023/06/15 07:50:27 DEBUG : Tést.txt: Modification times differ by -28ns: 2023-06-14 21:38:09.096275 -0700 PDT, 2023-06-14 21:38:09.096274972 -0700 PDT
2023/06/15 07:50:27 DEBUG : Encrypted drive 'crypt:': Waiting for transfers to finish
2023/06/15 07:50:27 DEBUG : Tést.txt: md5 = 945017222999911ea0a868c707ff7d63 OK
2023/06/15 07:50:27 INFO  : Tést.txt: Copied (replaced existing) to: Tést.txt
2023/06/15 07:50:27 INFO  : 
Transferred:   	         53 B / 53 B, 100%, 0 B/s, ETA -
Checks:                 1 / 1, 100%
Transferred:            1 / 1, 100%
Elapsed time:         0.0s

2023/06/15 07:50:27 DEBUG : 7 go routines active

rclone ls confirms it's still using c3 a9

rclone --config rclone.conf ls crypt: | hexdump -C
00000000  20 20 20 20 20 20 20 20  35 20 54 c3 a9 73 74 2e  |        5 T..st.|
00000010  74 78 74 0a                                       |txt.|
00000014

However if I start over fresh and use rclone copy first in an empty folder, it replaces c3 a9 with 'e' + cc 81 and we get the issue:

rclone --config rclone.conf copy test crypt:
rclone --config rclone.conf ls crypt: | hexdump -C
00000000  20 20 20 20 20 20 20 20  35 20 54 65 cc 81 73 74  |        5 Te..st|
00000010  2e 74 78 74 0a                                    |.txt.|
00000015

Why would rclone copy sometimes change the character but sometimes not? Strange...

FYI, I can not reproduce the issue on linux.

./rclone --version
rclone v1.62.2
- os/version: debian 11.6 (64 bit)
- os/kernel: 5.10.0-20-arm64 (aarch64)
- os/type: linux
- os/arch: arm64 (ARMv8 compatible)
- go/version: go1.20.2
- go/linking: static
- go/tags: none
ls test/Tést.txt | hexdump -C
00000000  74 65 73 74 2f 54 c3 a9  73 74 2e 74 78 74 0a     |test/T..st.txt.|
0000000f
ls mnt/Tést.txt | hexdump -C
00000000  6d 6e 74 2f 54 c3 a9 73  74 2e 74 78 74 0a        |mnt/T..st.txt.|
0000000e
./rclone --config rclone.conf copy -vv test crypt:
2023/06/15 08:36:11 DEBUG : rclone: Version "v1.62.2" starting with parameters ["./rclone" "--config" "rclone.conf" "copy" "-vv" "test" "crypt:"]
2023/06/15 08:36:11 DEBUG : Creating backend with remote "test"
2023/06/15 08:36:11 DEBUG : Using config file from "/tmp/repro/rclone.conf"
2023/06/15 08:36:11 DEBUG : fs cache: renaming cache item "test" to be canonical "/tmp/repro/test"
2023/06/15 08:36:11 DEBUG : Creating backend with remote "crypt:"
2023/06/15 08:36:11 DEBUG : Creating backend with remote "enc"
2023/06/15 08:36:11 DEBUG : fs cache: renaming cache item "enc" to be canonical "/tmp/repro/enc"
2023/06/15 08:36:11 DEBUG : Tést.txt: Modification times differ by 2m17.882520067s: 2023-06-15 08:32:33.859621236 -0700 PDT, 2023-06-15 08:34:51.742141303 -0700 PDT
2023/06/15 08:36:11 DEBUG : Encrypted drive 'crypt:': Waiting for checks to finish
2023/06/15 08:36:11 DEBUG : Encrypted drive 'crypt:': Waiting for transfers to finish
2023/06/15 08:36:11 DEBUG : Tést.txt: md5 = 5e6dbe627d73e9a05447946851889c60 OK
2023/06/15 08:36:11 INFO  : Tést.txt: Copied (replaced existing)
2023/06/15 08:36:11 INFO  : 
Transferred:   	         54 B / 54 B, 100%, 0 B/s, ETA -
Checks:                 1 / 1, 100%
Transferred:            1 / 1, 100%
Elapsed time:         0.1s

2023/06/15 08:36:11 DEBUG : 5 go routines active
./rclone --config rclone.conf ls crypt: | hexdump -C
00000000  20 20 20 20 20 20 20 20  36 20 54 c3 a9 73 74 2e  |        6 T..st.|
00000010  74 78 74 0a                                       |txt.|
00000014

After starting fresh.

./rclone --config rclone.conf copy test crypt:
./rclone --config rclone.conf ls crypt: | hexdump -C
00000000  20 20 20 20 20 20 20 20  36 20 54 c3 a9 73 74 2e  |        6 T..st.|
00000010  74 78 74 0a                                       |txt.|
00000014

This leads me to believe there is some sort of UTF normalization that takes place on the rclone copy side on macOS but only if the file doesn't already exist?

I am trying to understand this issue:) It has been around for very long and would be good to fix it finally. But I am still trying to grasp the problem.

try to mount with -o modules=iconv,from_code=UTF-8,to_code=UTF-8 then all files are visible in mount.

Now try to edit these two files in terminal - only one will work.

Then try from Finder - also only one will be accessible - but not the same as in terminal.

What a mess.

I did more tests - crypt/no crypt, macFUSE/FUSE-T, local/remote - my conclusion is that rclone mount in macOS is simply broken. Some past workarounds like setting iconv in mac fuse are not better that just snake oil - they fix some issues but create new ones. IMHO it is not as simple as fuse problem or rclone problem. There is fundamental issue how these two programs work together on macOS.

None of these issues is present in Linux or Windows - at the same time there are many other programs using macOS fuse working fine. So problem to fix is subtle - which is an issue on its own.

This is something I've spent quite a lot of time on in the past!

The problem is that macOS stores its file names in unicode NFD format rather than the format everyone else uses which is NFC.

This is the difference between the two forms of the Tést.txt file.

All the cloud providers (and in fact everyone else in the entire universe) uses NFC format. This is the é \xc3\xa9 format rather than the NFD format which is e\xcc\x81. rclone copy goes to some effort to match the two types of normalization up. rclone mount doesn't though.

What -o modules=iconv,from_code=UTF-8,to_code=UTF-8 does is tells fuse not to touch the UTF-8 format rclone uses.

The default here is -o modules=iconv,from_code=UTF-8,to_code=UTF-8-MAC which tells fuse to convert the UTF-8 rclone uses into NFD UTF-8 which macOS likes.

This used to work fine! However I believe that newer macOS don't actually need the NFD form any more or something has changed in macFUSE.

Note in your example above

ls gave the file name as Tést.txt which is 54 65 cc 81 73 74 2e 74 78 74 which is NFD but you typed
Tést.txt which is 54 c3 a9 73 74 2e 74 78 74 which is NFC. I think if you'd cut and pasted exactly what you got from ls it would have worked.

That is macOS doing the changing, not rclone.

Maybe rclone should be doing the NFD->NFC itself in rclone mount on macOS so you can use either normalisation.

Anyway this is a can of worms which you thank Apple for!

1 Like

all you said it right but:

I have to spend more time to get more understanding/testing:)

It is true that Apple using NFD and everybody else NFC creates some funny problems. And that Apple filesystem actually is not using any normalization - filenames are just a ‘bag of bytes’ - moving problem to user space.

Still I think we can improve it or at least document better. I will report back when I have some facts.

Yes this needs a bit of experimentation and some more docs I think.

I'm a bit hampered because I don't have a mac so any help much appreciated!

Thank you @ncw and @kapitainsky for taking the time to look into this!

I am so far unable to reproduce this issue with a debugger attached.
fs/sync/sync.go @ 886: NoUnicodeNormalization is indeed set to false and I can see both versions of the string being correctly transformed to c3 a9

However, in my original example and in @kapitainsky's quote at the bottom, the un-normalized 'e' + cc 81 slipped through an rclone copy into the crypt somehow.

rclone copy produced 4mvk0n0mss2apim05t4a72hbc4 which decodes to:

rclone --config rclone.conf cryptdecode crypt: "4mvk0n0mss2apim05t4a72hbc4" | hexdump -C
00000000  34 6d 76 6b 30 6e 30 6d  73 73 32 61 70 69 6d 30  |4mvk0n0mss2apim0|
00000010  35 74 34 61 37 32 68 62  63 34 20 09 20 54 65 cc  |5t4a72hbc4 . Te.|
00000020  81 73 74 2e 74 78 74 0a                           |.st.txt.|
00000028

rclone mount produced 5bqaptil47irc1lcspojffi2gk

rclone --config rclone.conf cryptdecode crypt: "5bqaptil47irc1lcspojffi2gk" | hexdump -C
00000000  35 62 71 61 70 74 69 6c  34 37 69 72 63 31 6c 63  |5bqaptil47irc1lc|
00000010  73 70 6f 6a 66 66 69 32  67 6b 20 09 20 54 c3 a9  |spojffi2gk . T..|
00000020  73 74 2e 74 78 74 0a                              |st.txt.|
00000027

From @kapitainsky :

$ find .
.
./jtk5unr795oqf0gk9gfja801o8
./jtk5unr795oqf0gk9gfja801o8/4mvk0n0mss2apim05t4a72hbc4 <------------- `rclone copy`
./7fbqbud3k1fhoeu7a7lc8b5hi4
./7fbqbud3k1fhoeu7a7lc8b5hi4/5bqaptil47irc1lcspojffi2gk <------- `cp in rclone mount`
./4mvk0n0mss2apim05t4a72hbc4  <------------- `rclone copy`

My tests so far show that does not matter Linux or macOS using rclone I can create content with NFD or NFC names' encoding. Difference is that Linux does not care - any content works.

macOS rclone mount has no problem dealing with NFC names - either folders or files. No special options are required for FUSE-T. Content is accessible in shell and in Finder.

When there are NFD names then to see them in mount we have to add -o modules=iconv,from_code=UTF-8,to_code=UTF-8 to make content visible. But then there are new problems - NFD files are not accessible in shell and NFC not in Finder... far from usable.

It looks for me that FUSE-T/rclone should convert NFD->NFC to make it work in macOS.