Next: , Previous: Data Files, Up: Top


15 Unicode Support

The standard units data file is in Unicode, using UTF-8 encoding. Most definitions use only ASCII characters (i.e., code points U+0000 through U+007F); definitions using non-ASCII characters appear in blocks beginning with ‘!utf8’ and ending with ‘!endutf8’.

When units starts, it checks the locale to determine the character set. If units is compiled with Unicode support and definitions; otherwise these definitions are ignored. When Unicode support is active, units will check every line of all of the units data files for invalid or non-printing UTF-8 sequences; if such sequences occur, units ignores the entire line. In addition to checking validity, units determines the display width of non-ASCII characters to ensure proper positioning of the pointer in some error messages and to align columns for the ‘search’ and ‘?’ commands.

At present, units does not support Unicode under Microsoft Windows. The UTF-16 and UTF-32 encodings are not supported on any systems.

If definitions that contain non-ASCII characters are added to a units data file, those definitions should be enclosed within ‘!utf8...!endutf8’ to ensure that they are only loaded when Unicode support is available. As usual, the ‘!’ must appear as the first character on the line. As discussed in Units Data Files, it's usually best to put such definitions in supplemental data files linked by an ‘!include’ command or in a personal units data file.

When Unicode support is not active, units makes no assumptions about character encoding, except that characters in the range 00–7F hexadecimal correspond to ASCII encoding. Non-ASCII characters are simply sequences of bytes, and have no special meanings; for definitions in supplementary units data files, you can use any encoding consistent with this assumption. For example, if you wish to use non-ASCII characters in definitions when running units under Windows, you can use a character set such as Windows “ANSI” (code page 1252 in the US and Western Europe). You can even use UTF-8, though some messages may be improperly aligned, and units will not detect invalid UTF-8 sequences. If you use UTF-8 encoding when Unicode support is not active, you should place any definitions with non-ASCII characters outside!utf8...!endutf8’ blocks—otherwise, they will be ignored.

Typeset material other than code examples usually uses the Unicode minus (U+2212) rather than the ASCII hyphen-minus operator (U+002D) used in units; the figure dash (U+2012) and en dash (U+2013) are also occasionally used. To allow such material to be copied and pasted for interactive use or in units data files, units converts these characters to U+002D before further processing. Because of this, none of these characters can appear in unit names.