Introduction

The OpenEye LEXICHEM product contains three applications, nam2mol, mol2nam and translate.

  1. The nam2mol program is an application for converting compound names into chemical structures. The program currently converts text files containing a single name per line, in either American or British English, into a database of molecules, using a choice of file formats, including MDL SD file, SMILES, SLN or Tripos .mol2. This program does not require the input compound name to be the preferred IUPAC name of a compound and it will work with a variety of traditional names and/or alternate IUPAC forms.
  2. The mol2nam program is an application to perform the opposite conversion, translating chemical structures into a reasonable compound name.
  3. Finally, the translate utility program converts chemical names from one language to another.

Input Name Representation

The oeiupac library currently processes NUL (zero) terminated ASCII character strings; therefore Greek characters, symbols, fonts and superscripts must be transliterated into the printable subset of ASCII. When parsing compound names, the oeiupac library considers both spaces and tab characters as interchangeable, and any number of consecutive ‘whitespace’ characters are treated as a single space.

Currently, the name parsing is case insensitive, allowing arbitrary mixing of upper and lower case characters, e.g. initial letter capitalization.

Greek characters are understood in a number of different representations. For example, the strings ‘$a’, ‘${a}’, ‘alpha’, ‘.alpha.’, ‘α’, ‘α’ and ‘α’ are all understood to represent the Greek character alpha, (\alpha).

There is no special representation for italic characters. Compound names such as ‘tert-butyl’ and ‘p-aminobenzamidine’ are represented as ‘tert-butyl’ and ‘p-aminobenzamidine’. Both the long and short forms of prefixes can be used, allowing the above examples to also be written as ‘t-butyl’ and ‘para-aminobenzamidine’.

Output Name Representation

Unrecognized functional groups, linkers or ring systems are denoted in the generated name as the string ‘BLAH’. As much of the name as possible is generated resulting in compound names such as ‘dichloroBLAHcarboxylic acid’. Generated compound names are entirely lower case, with no initial capitalization. Upper case characters are generated for locants and, as described above, for BLAH.

When generating Greek characters in compound names, the oeiupac library currently uses the dollar character followed by single letter representation. In this formalism, ‘$a’ represents the Greek character alpha, \alpha, ‘$b’ the Greek character beta, \beta, ‘$g’ the Greek character gamma, \gamma and ‘$l’ the Greek character lambda, \lambda .

When generating superscripts, the oeiupac library currently uses the caret and curly braces representation. Hence ‘$l^{5}’ represents the Greek character lambda followed by a superscript five, i.e. \lambda^5. Similarly, ‘pentacyclo[4.2.0.0^{2,5}.0^{3,8}.0^{4,7}]octane’ would be the von Baeyer system name for cubane, i.e.

pentacyclo [4.2.0.0^{2,5}.0^{3,8}.0^{4,7}] octane.

Multiple components in a disconnected molecule, apart from common salts and counter ions, are separated from each other by a semicolon followed by a space. Mixtures containing salts are written ordering the cations before the compound name, followed by anions, finally followed by any common neutral molecules (e.g. hydrate or hydrochloride).

Output Name Styles

The LEXICHEM compound naming functionality supports the generation of several styles of compound names. The currently predefined name styles are OpenEye (the default), IUPAC, CAS, Traditional and Systematic. OpenEye names loosely correspond to the kinds of names familiar to a medicinal chemist. These names are intended to be a subset of the IUPAC 2005 standard’s acceptable names, but not necessarily the PIN (Preferred IUPAC Name). These correspond to the types of names found in a Sigma-Aldrich catalog or a Journal of Medicinal Chemistry article, for example.

IUPAC names are intended to follow the IUPAC 2005 recommendations for the Preferred IUPAC Name (PIN). Future releases of LEXICHEM may further refine this definition to provide IUPAC2005, IUPAC93 and IUPAC79 name styles that reflect the corresponding standard’s preferred name.

The LEXICHEM CAS name style is intended to follow the Chemical Abstracts Service’s naming conventions, where they differ from IUPAC’s.

The Traditional name style corresponds to forms of compound naming that are now no longer acceptable to the IUPAC rules. The boundary between whether a trivial/common name is considered OpenEye or Traditional when it is acceptable to IUPAC but not preferred is blurred, with OpenEye attempting to follow the more prevalent usage.

Finally, Systematic names correspond to the fully systematic IUPAC names that the IUPAC preferred names are slowly converging towards.

Examples of Name Style Differences

Some of the concepts explained in the previous section are probably best clarified through some real examples.

Example OpenEye vs. IUPAC vs. Systematic Differences

  • The SMILES string O is called ‘water’ by the OpenEye name style, but ‘oxidane’ by the IUPAC and Systematic name styles.
  • The SMILES C#C is called ‘acetylene’ by the OpenEye and IUPAC name styles, but ‘ethyne’ by the Systematic name style.
  • The SMILES prefix *Nc1ccccc1 is called ‘anilino’ by the OpenEye and IUPAC name styles, but ‘phenylamino’ by the Systematic name style.
  • The SMILES prefix *O[N+]#[C-] is called ‘fulminato’ by the OpenEye name style, but ‘isocyanooxy’ by the IUPAC and Systematic name styles.
  • The SMILES prefix *C(=O)C is called ‘acetyl’ in the OpenEye and IUPAC name styles, but ‘ethanoyl’ in the Systematic name style.
  • The SMILES string CC(=O)C is called ‘acetone’’ in the OpenEye name style, but ‘propan-2-one’ in the IUPAC and Systematic name styles.
  • The SMILES string C(=O)O is called ‘formic acid’ in the OpenEye and IUPAC name styles, but ‘methanoic acid’ in the Systematic name style.

Example OpenEye/IUPAC vs. CAS Differences

  • The SMILES string c1ccccc1CCCCCCC is named as ‘1-phenylheptane’ by the OpenEye and IUPAC name styles, but as ‘heptylbenzene’ by the CAS name style.
  • The SMILES prefix *[BH2] is called ‘boranyl’ by the OpenEye and IUPAC name styles, but as ‘boryl’ by the CAS name style.

Example OpenEye/IUPAC vs. Traditional Differences

  • The SMILES prefix *S is called ‘sulfanyl’ by the OpenEye and IUPAC name styles, but as ‘mercapto’ by the Traditional name style.
  • The SMILES string CCCCCCCCC(=O)O is called ‘nonanoic acid’ by the OpenEye and IUPAC name styles, but as ‘pelargonic acid’ by the Traditional name style.