Substructure Search with MDL Queries

Section Substructure Search describes how to perform substructure search initialized with a SMILES or SMARTS string. OEChem TK also provides the ability to interpret and utilize query structures expressed in the MDL query file format (see MDL query example in Figure: Example of MDL query substructure). Listing 1 shows how to initialize an OESubSearch object from a MDL query file and perform a substructure search.

../_images/MDLQuery.png

Example of MDL query substructure

Listing 1: Example of substructure search using MDL query file

using System;
using OpenEye.OEChem;

public class MDLQuerySearchSingleMatch
{
    public static int Main(string[] args)
    {
        oemolistream qfile = new oemolistream("query.mol");
        oemolistream tfile = new oemolistream("targets.sdf");

        // set the same aromaticity model for the query and the target file
        uint aromodel = OEIFlavor.Generic.OEAroModelMDL;
        uint qflavor = qfile.GetFlavor(qfile.GetFormat());
        qfile.SetFlavor(qfile.GetFormat(), (qflavor | aromodel));
        uint tflavor = tfile.GetFlavor(tfile.GetFormat());
        tfile.SetFlavor(tfile.GetFormat(), (tflavor | aromodel));

        // read MDL query and initialize the substructure search
        uint opts = OEMDLQueryOpts.Default | OEMDLQueryOpts.SuppressExplicitH;
        OEQMol qmol = new OEQMol();
        OEChem.OEReadMDLQueryFile(qfile, qmol, opts);
        OESubSearch ss = new OESubSearch(qmol);
        // loop over target structures
        int tindex = 1;
        OEGraphMol tmol = new OEGraphMol();

        while (OEChem.OEReadMolecule(tfile, tmol))
        {
            OEChem.OEPrepareSearch(tmol, ss);
            if (ss.SingleMatch(tmol))
            {
                Console.WriteLine("hit target = " + tindex);
            }

            tindex += 1;
        }
        return 0;
    }
}

After opening the MDL query and the target files, the model used to assign aromaticity to the imported structures can be adjusted.

uint aromodel = OEIFlavor.Generic.OEAroModelMDL;
uint qflavor = qfile.GetFlavor(qfile.GetFormat());
qfile.SetFlavor(qfile.GetFormat(), (qflavor | aromodel));
uint tflavor = tfile.GetFlavor(tfile.GetFormat());
tfile.SetFlavor(tfile.GetFormat(), (tflavor | aromodel));

If the aromaticity model is not specified for the input files, then the OpenEye’s aromaticity model is used by default. For more information about the various aromaticity models of OEChem TK see Aromaticity Perception.

OEChem.OEReadMDLQueryFile(qfile, qmol, opts);
OESubSearch ss = new OESubSearch(qmol);

In general, the aromaticity model chosen should be consistent between the query and target molecules to be searched. Using different aromaticity models may produce false negatives as aromatic systems may be treated differently. Section Aromaticity further explains the effects of using various aromaticity models when performing a substructure search.

OEReadMDLQueryFile function reads the MDL query directly into a OEQMolBase object , which then can be used to initialize an OESubSearch instance.

The MDL query structure can also be read into a OEMolBase object (see code snippet below). In this case, the OEReadMDLQueryFile function attaches the query features present in the input MDL file to the related atoms and bonds of the OEMolBase object. The OEQMolBase object can be subsequently created by calling the OEBuildMDLQueryExpressions function.

OEGraphMol mol = new OEGraphMol();
OEChem.OEReadMDLQueryFile(qfile,mol);
// mol can be manipulated here
OEQMol qmol = new OEQMol();
// build OEQMol with OEMDLQueryOpts.Default option
OEChem.OEBuildMDLQueryExpressions(qmol,mol);
OESubSearch ss = new OESubSearch(qmol);

The declaration of these functions are:

bool OEReadMDLQueryFile(oemolistream ifs, OEMolBase mol);
bool OEReadMDLQueryFile(oemolistream ifs, OEQMolBase qmol, int opts);
// opts (default value = OEMDLQueryOpts.Default)
bool OEBuildMDLQueryExpressions(OEQMolBase qmol, OEMolBase mol, uint opts);
// opts (default value = OEMDLQueryOpts.Default)

MDL Query Interpretation

The opts parameter defines how the MDL query is interpreted when an OEQMolBase object is constructed. The following options are present in the OEMDLQueryOpts namespace:

  1. OEMDLQueryOpts.Default

    Only constraints explicitly specified in the MDL file are added to the OEQMolBase query structure. See section Supported MDL Query Features for the supported MDL query features.

  2. OEMDLQueryOpts.SuppressExplicitH

    This option controls how the explicit hydrogens of the query are matched to the explicit/implicit hydrogens of the target structures. For more information see Explicit Hydrogens.

  3. OEMDLQueryOpts.AddBondAliphaticConstraint

    If this option is specified, then an aliphatic query bond can only be mapped to the aliphatic bonds in the target structure. Figure: Interpretation A shows how the MDL query structure is interpreted when the OEMDLQueryOpts.AddBondAliphaticConstraint option is used.

    Query ‘A’ will match all three of the target compounds displayed in Figure: Interpretation A. If the OpenEye model is used to perceive aromatic rings, then query ‘B’ substructure is present only in target ‘2’. If the MDL aromaticity model is used, then target ‘3’ is also a hit, since five-membered heterocycles are not considered aromatic in this model. For more information about different aromaticity models and their effects on substructure searching see section Aromaticity.

    ../_images/mdlq-ConstAliphatic.png

    Interpretation A

    Interpretation of the MDL query with ‘AddBondAliphaticConstraint’ option.

  4. OEMDLQueryOpts.AddBondTopologyConstraint

    By default, a bond that is part of any ring system in the query structure can only be mapped to ring bonds in the target structure. If the OEMDLQueryOpts.AddBondTopologyConstraint option is specified, constraints are also added to the chain bonds of the query in order to map them to only chain bonds in the target.

    Figure: Interpretation B shows how the MDL query structure is interpreted when the OEMDLQueryOpts.AddBondTopologyConstraint option is used. Query ‘A’ will match all three of the target compounds displayed in Figure: Interpretation B, while query ‘B’ is present only in target ‘3’.

    If the ‘ring’ property is specified in the MDL query file for a particular chain bond in the query structure, then its topology constraint is overridden.

    ../_images/mdlq-ConstTopology.png

    Interpretation B

    Interpretation of the MDL query with `AddBondTopologyConstrain’ option.

  5. OEMDLQueryOpts.MatchAtomStereo

    When this option is enabled, an S/R atom stereo configuration in the query will only match any S/R configuration in the target molecule, but not to a R/S (opposite) or an unspecified one. If an atom stereo configuration is undefined in the query, it is allowed to match to any target atom regardless of its stereo configuration.

  6. OEMDLQueryOpts.MatchIsotope

    When this option is enabled, query atoms with an isotope can match to a target atom only if they have the same atomic mass. If the query atom has no specified isotope number, it will match to any target atom regardless of its atomic mass.

Supported MDL Query Features

The table below summarizes the MDL query features currently supported by the OEChem TK. These query feature can only be read with the low level OEReadMDLQueryFile function and not the high-level OEReadMolecule function.

Summary of supported MDL query features in V2000 and V3000

CTFile V2K

CTFile V3K

Notes

M  CHG

CHG=

formal charge

<atomblock>

HCOUNT=

number of allowed hydrogens V2K:0=off, 1=H0, 2-5 for H1-H4; V3K:0=off, -1=H0, 1-5 for H1-H5

M  ISO

MASS=

mass difference

M  RBC

RBCNT=

limit of number of allowed ring bonds attached to an atom

M  SUB

SUBST=

number of allowed substitutions of an atom

M  ALS

[aaa,bbb,..]

list alternative atom types for an atom

<atomblock>

STBOX=

stereo care (e.g. for c/t doubles)

L,A,Q

L,A,Q

L=atom list, A=Any atom except hydrogen, Q=Any atom except hydrogen and carbon

M  UNS

UNSAT=

specifies whether or not an atom is unsaturated, i.e., having at least one multiple bond

<bondblock>

STBOX=

bond stereochemistry that is considered, if both ends of the bond are marked with stereo care flags in the atom block

<bondblock>

TOPO=

bond topology (0=off, 1=ring, 2=chain)

<bondblock>

<bondblock>

4-aromatic, 5=single or double, 6=single or aromatic, 7=double or aromatic, 8=any

../_images/mdlq-Example-Query.png

Query

Example of MDL query structure.

See also

Aromaticity

Perceiving aromaticity in the query and the target structures is important in order to ensure that the result of a substructure search is independent of the different Kekulé representations of the participating structures. A query bond which is part of an aromatic ring system can be mapped to any aromatic bonds in the target. Figure: Aromaticity Match shows an example where both Kekulé representations of the benzene-1,2-diol substructure are present in the two target structures.

../_images/mdlq-Aromaticity-Kekule.png

Aromaticity Match

The result of substructure search is independent of the Kekulé representations of the participating structures.

Altering the aromaticity model will affect the results of a substructure search. Figure: Aromaticity AFigure: Aromaticity C show several examples where different results were obtained by applying the MDL and the OpenEye aromaticity models. This is a consequence of the fact that in the MDL aromaticity model, five-membered heterocycles are not considered aromatic.

Note

It is highly recommended to apply the same aromaticity model to the query and the target structures.

Listing 1 shows an example of how to change the aromaticity flavor of input files.

../_images/mdlq-Aromaticity-A.png

Aromaticity A

Example of a substructure search applying different aromaticity models.

../_images/mdlq-Aromaticity-B.png

Aromaticity B

Example of a substructure search applying different aromaticity models.

../_images/mdlq-Aromaticity-C.png

Aromaticity C

Example of a substructure search applying different aromaticity models.

Aromaticity with Generic Atoms

If a ring in the query structure contains generic atom(s) (see example in Figure: Generic atom example A), then the aromaticity of the ring can not be perceived. In order to maintain the independence from the Kekulé representation, 6-membered rings with alternating single/double bonds are assumed to be aromatic.

../_images/mdlq-Aromaticity-Mem6-GenericAtom.png

Generic atom example A

Example of aromaticity assumption in a 6-membered ring with generic atoms.

Similarly, a 5-membered ring with generic atom(s) is considered aromatic if it is composed of two single and two double bonds. See Example in Figure: Generic atom example B.

../_images/mdlq-Aromaticity-Mem5-GenericAtom.png

Generic atom example B

Example of aromaticity assumption in a 5-membered ring with generic atoms.

Explicit Hydrogens

During the substructure search, each query atom has to be mapped to a target atom in order to detect subgraph isomorphism. Therefore, a problem can arise if the query structure contains explicit hydrogens or an atom list with hydrogens (see example in Figure: Query), but the target structure has implicit hydrogens Figure: Targets)

../_images/MDLQueryExplicitH.png

Example of MDL query structure with explicit hydrogens

Listing 2: Example of substructure search with accessing atom mapping

using System;
using OpenEye.OEChem;

public class MDLQuerySearchMapping
{
    public static int Main(string[] args)
    {
        oemolistream qfile = new oemolistream("query.mol");
        oemolistream tfile = new oemolistream("targets.sdf");

        //read MDL query and initialize the substructure search
        OEQMol qmol = new OEQMol();
        OEChem.OEReadMDLQueryFile(qfile, qmol);
        OESubSearch ss = new OESubSearch(qmol);

        // loop over target structures
        int tindex = 1;
        OEGraphMol tmol = new OEGraphMol();
        while (OEChem.OEReadMolecule(tfile, tmol))
        {
            OEChem.OEAddExplicitHydrogens(tmol);
            OEChem.OEPrepareSearch(tmol, ss);
            foreach (OEMatchBase match in ss.Match(tmol, true))
            {
                Console.Write("hit target = " + tindex);

                foreach (OEAtomBase a in match.GetTargetAtoms())
                {
                    Console.Write(" " + a.GetIdx() + OEChem.OEGetAtomicSymbol(a.GetAtomicNum()));
                }

                Console.WriteLine();
            }

            tindex += 1;
        }
        return 0;
    }
}

This problem can be solved in two ways:

  1. The explicit hydrogens in the query molecule can be suppressed during OEQMolBase construction by using the OEMDLQueryOpts.SuppressExplicitH option (see example in Listing 1). A query atom can only be mapped to a target atom if it has at least as many implicit hydrogens as explicit hydrogens were removed from the query atom. This solution is recommended only if the presence or absence of the query substructure is of interest, but not the complete mapping between the query and the target. The SingleMatch function returns whether or not the query is present in the target, but the mapping is not accessible.

  1. If the complete mapping between the query and the target is of interest, the OEAddExplicitHydrogens function has to be called for each target structure before performing the substructure search. (See example in Listing 2).

In both cases, the query presented in Figure: Query will match only target ‘C’ and ‘D’ shown in Figure: Targets. Figure: Atom mapping shows the three detected substructures in target ‘C’ when adding explicit hydrogens to the target structures.

The execution of the substructure search is significantly faster if the hydrogens are suppressed in the target structures, since the search space can be an order of magnitude smaller.

../_images/MDLQueryExplicitHMatch.png

Targets – Example of target structures with implicit hydrogens

../_images/MDLQueryExplicitHMapping.png

Atom mapping – Example of substructure search with atom mapping