Rclone mount encode converted from utf-8 to TIS-620?

green · July 10, 2017, 7:11am

I have a directory named Christian Löffler and when I rclone sync it to an rclone crypt it also appears as Christian Löffler except that it is now encoded at TIS-620 when on my local drive it is utf-8

this is on debian 9. local filesystem is ext4

my sync and mount commands are not exotic - using defaults…

$ rclone --version
rclone v1.36

and

$ echo $LANG
en_US.UTF-8

using python to check rclone mount.

>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls Christian*").read())
{'confidence': 0.99, 'encoding': 'TIS-620'}

comparing to local filesystem

Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls Christian*").read())
{'confidence': 0.938125, 'encoding': 'utf-8'}

I am comparing these files with bash, so the different encoding messes with my script.

ncw · July 10, 2017, 9:59pm

Can you paste the results of ls for the two systems?

I suspect the difference will be unicode normalization rather than a different encoding.

If you use the latest beta then you can disable rclone’s unicode normalization with

  --local-no-unicode-normalization    Don't apply unicode normalization to paths and filenames

I’m going to re-visit the normalization in #1477 for 1.38.

green · July 13, 2017, 9:10am

thanks for the details about the beta and --local-no-unicode-normalization - I tried it and that does make the difference. I assume there is a good reason for unicode normalisation though? Is there anything specific I should be aware of using rclone crypt?

The ideal would be a way for bash to do its own unicode normalisation. Are you aware of any such mechanism? in my testing I found that mv for example appeared to see the variations of the ö as the same. I can however create identical sub-directories in the same directory with the different ö.

Also, for the record, I think the tools mis-reported the different encodings and the remote (crypt) is utf8 and the local is something like ISO-8859-2 as it makes a lot more sense than TIS-620