Lexichem TK 2.6.0¶

New features¶

OEFromBritish function has been added for completeness. The name perception code already handled British-style input names without this conversion function; however, it is useful to have the function available for use by higher-level APIs that dispatch based on input language.

Handling of character sets for non-English language names has been improved. The internal string representation of all non-ASCII characters is now their \u escaped Unicode codepoints. Previously, Latin-1 characters (e.g., the output of the OEToFrench) were encoded as their Latin-1 bytes (0x80 - 0xff). Now they are encoded as escaped unicode. This eliminates character encoding problems for the Java and Python language wrappers that occurred when intermediate name strings were being processed. Now, only the following final character set encoding/decoding functions need to handle non-ASCII bytes: OEFromHTML, OEFromKOI8R, OEFromLatin1, OEFromUTF8, OEToEUCJP, OEToHTML, OEToSJIS, OEToLatin1, and OEToUTF8.

Name-to-structure conversions for certain uncommon functional groups (nitramide, nitrile oxide, oxycyano) previously resulted in structures with 5-valent nitrogen. These now result in charge-separated N(III) structures, consistent with other nitrogen functional groups.

The documentation has been updated to indicate the range of Unicode codepoints that are handled for each of the language translation functions. This allows users to determine the appropriate output character set conversion. For example, the Chinese language translation results in CJK characters in the Unicode Basic Multilingual Plane (BMP). Therefore, it is only appropriate to convert Chinese language names to UTF8 or HTML, which can handle the full Unicode BMP. Since the French language translation results in ASCII and Latin-1 characters, it is reasonable to use Latin1, UTF8, or HTML as the output character set.
The example programs have been updated to better illustrate character set conversions.