SMILES Line Notation¶
A SMILES syntax overview can be found at the documentation of SMILES on the Daylight Chemical Information Systems site
Unfortunately, there are a number of ambiguities in the original paper describing the Daylight SMILES syntax (refs below). These have led to different SMILES being accepted or rejected by different SMILES parser implementations. The following historical table illustrates some of these corner-cases. These types of cases continue to be an issue with SMILES parsers.
SMILES |
Daylight 4.4-4.9 |
Corina 1.6 |
Corina WWW |
Concord 3.2.1 |
COBRA 3.21A |
Synopsis 4.0 |
OEChem 4.1.0.0 |
---|---|---|---|---|---|---|---|
|
Y |
Y |
Y |
N |
N |
Y |
Y |
|
Y |
Y |
Y |
N |
N |
N |
Y |
|
Y |
Y |
Y |
N |
N |
Y |
Y |
|
Y |
N |
N |
N |
Y |
N |
Y |
|
Y |
Y |
Y |
N |
N |
Y |
Y |
|
Y |
Y |
N |
Y |
Y |
Y |
Y |
|
N |
N |
N |
N |
N |
Y |
Y |
|
N |
N |
N |
N |
N |
Y |
Y |
|
N |
N |
N |
Y |
Y |
N |
Y |
|
N |
Y |
N |
Y |
Y |
N |
Y |
|
N |
Y |
Y |
Y |
Y |
Y |
Y |
|
N |
Y |
N |
N |
Y |
N |
N |
|
N |
Y |
N |
Y |
N |
Y |
Y |
|
N |
Y |
N |
Y |
N |
N |
Y |
|
N |
Y |
N |
N |
N |
N |
Y |
|
N |
N |
N |
N |
N |
N |
Y |
|
N |
Y |
N |
N |
N |
N |
Y |
|
N |
Y |
N |
N |
N |
N |
Y |
|
N |
N |
N |
N |
N |
N |
Y |
|
N |
N |
N |
N |
N |
N |
Y |
Because of these types of SMILES ambiguities, the OEChem TK SMILES parser has two modes. The default is `relaxed`, which is more forgiving and accepts SMILES as per the table above and enables the SMILES extensions described in the next section. The SMILES parser also has a `strict` mode that is far less forgiving about dubious SMILES strings.
Extensions to Daylight SMILES¶
The OEChem TK SMILES parsers support several minor extensions to Daylight syntax. Each of these extensions and its motivations are listed below.
- Quadruple Bond
In addition to
-
,=
and#
for specifying single, double and triple bonds respectively, OEChem TK also supports$
for specifying quadruple bonds. An example would be octachlorodirhenate (III), which is written as[Rh-](Cl)(Cl)(Cl)(Cl)$[Rh-](Cl)(Cl)(Cl)Cl
.- Unquoted and Additional Elements
In addition to the standard Daylight unquoted elements, B, C, N, O, F, P, S, Cl, Br and I, OEChem TK’s SMILES readers also allow
H
,D
andT
to specify hydrogen, deuterium and tritium. Additionally, to support Syracuse SMILES, ‘CL’ and ‘BR’ are considered ‘Cl’ and ‘Br’. The periodic table is also extended from 102 to 109 elements, i.e.[Sg]
for Seaborgium, with the addition of[D]
and[T]
representing[2H]
and[3H]
respectively.OEChem TK may support
Na
,Li
, andK
as unquoted elements to support Syracuse SMILES at some point in the future.- Aromatic Tellurium
In order to support OpenEye’s aromaticity model, which allows Tellurium to be aromatic, the SMILES parser has been extended to support
[te]
, such as in tellurophene,[te]1cccc1
, which follows in the sequence furan (o1cccc1
), thiophene (s1cccc1
) and selenophene ([se]1cccc1
).- Atom Maps in Molecules
Traditionally, SMILES atom maps, i.e.
[Pb:1]
, are only ever used and specified in reaction molecules,[Pb:1]>>[Au:1]
. However, OEChem TK extends this notion to allow atom maps to be used in discrete molecules. This is often useful for denoting significant sites or attachment points in a molecule. Traditionally in SMILES, isotopes of element zero have been used to perform this role, however in OEChem TK both[*:1]
and[1*]
may be used.- RGroup Attachment Points
As a short hand to support specifying templates for combinatorial libraries, and to support existing Cactus and JChem/Marvin usage, OEChem TK allows
[R2]
to be used as short-hand for[*:2]
. For inquiring minds, the SMILES[R2:3]
is interpreted as[*:3]
or[R3]
, with the last specification taking priority.- External Bond Attachment Points
OEChem TK SMILES also allows supports external closures, or potentially unsatisfied ring closures. These have the syntax, ampersand followed by a ring closure specification, i.e. an optional bond order followed by either a digit or a
%
character and two digits. The index space of external bonds and ring closures is separate, so that the ring closure2
and the external attachment point&2
are unrelated.When external attachment points are paired within a SMILES string, they behave identically to ring closures, using a separate index space. Hence, the SMILES
c&1ccccc&1
is interpreted the same way asc1ccccc1
, andC&1.C&1
is interpreted likeC1.C1
, i.e. the SMILESCC
.However, unlike ring closures, unpaired external attachment points are allowed and are interpreted like RGroup attachment points above. Hence, the SMILES
CC&1
(on its own) is equivalent to the RGroup attachment SMILESCC[R1]
, which is equivalent to the atom mapped moleculeCC[*:1]
.The major advantage of these semantics, inspired by Daylight’s CHUCKLES, is that it allows convenient enumeration of combinatorial libraries using string concatenation. For example, three components of a library may be specified as
C&1CCC&2
,F&1
andBr&2
. The combined stringC&1CCC&2.F&1.Br&2
is interpreted as the reaction product, i.e.FCCCCBr
.As with ring closures, bond orders may be specified after the ampersand and before the closure index,
C&=1
, and two digit closures are indicated by a%
prefix, i.e.C&%12
orC&=%12
.
See also