Unicode Normalization in Log Output

(This is a "General Support" question; the template does not work.)

Question:

  1. When rclone syncs a file that has a Unicode composed sequence in the name, it normalizes the name prior to syncing (Source: fs: stop normalizing file names but do a normalized compare in the sync · rclone/rclone@7f9a11d · GitHub )

  2. If logging is enabled, which filepath is reported in the log for each sync event: the normalized one, or the one that the local filesystem had, prior to rclone normalization?

Goal:

I'm writing a Mac app. I want to consume the paths reported in an rclone log, then use those paths to look up objects in a standard Dictionary-like data structure.

To avoid VERY costly "fuzzy" string comparisons, I need to know the exact normalization strategy that takes place for logged paths so that the keys I assign have the same Unicode normalization that rlcone will use in the log. Thank you!

I moved it to dev discussions as that's more fitting.

Sounds like you look for a definitive answer, and we'll probably have to wait for @ncw to stop by for that. While waiting: I've peeked into on the relevant code, and it seems to me the answer is it will log the original names.

The "march" traversal code does:

converts a name into a form which is used for comparison

This includes unicode normalization, and in some cases also making it lowercase (when case-insensitive comparisons are needed).

And then collects objects:

// matchEntry is an entry plus transformed name
type matchEntry struct {
    entry fs.DirEntry
    leaf  string
    name  string
}

The name is where the result of the unicode normalization is kept, and it is used for matching during the traversal. But further on, the sync and other code, uses the entry. Logging such as "Copied (replaced existing)", "Moved (server-side)" will use the entrys name - not the transformed name.

Hi Bryan,

You may be lucky, but I don’t think there is an explicit strategy/standard to support easy parsable (unicode normalized) file names in log output.

I personally haven’t paid attention to this in the (few debug) log messages that I have added, nor have I seen any unit tests to ensure adherence to a log output standard.

Chances are probably best if you rely on output from a few select print statements in the same area of code, such as the “march” mentioned above.

Beware of UTF/character translation differences between different remotes. I ditched a hash comparison script after hitting that bump. I don't remember the specifics anymore.

Thanks! I figured there wasn’t a standard here, which is what I was hoping to address.

It seems like the the logs are meant to be parseable—there’s a JSON output option, for instance. Perhaps the best approach is to guarantee that the format of paths in the log will match the format as it was on the source platform? That way, any scripts/apps interfacing with rclone can count on getting the exact same Unicode sequence whether they read the path from disk, or from rclone’s log.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.