Substructure Search with MDL Queries

Section Substructure Search describes how to perform substructure search initialized with a SMILES or SMARTS string. OEChem also provides the ability to interpret and utilize query structures expressed in the MDL query file format (see MDL query example in Figure: Example of MDL query substructure). Listing 1 shows how to initialize an OESubSearch object from a MDL query file and perform a substructure search.

../_images/MDLQuery.png

Example of MDL query substructure

Listing 1: Example of substructure search using MDL query file

#!/usr/bin/env python
from __future__ import print_function

from openeye.oechem import *

qfile = oemolistream("query.mol")
tfile = oemolistream("targets.sdf")

# set the same aromaticity model for the query and the target file
aromodel = OEIFlavor_Generic_OEAroModelMDL
qflavor = qfile.GetFlavor(qfile.GetFormat())
qfile.SetFlavor(qfile.GetFormat(), (qflavor | aromodel))
tflavor = tfile.GetFlavor(tfile.GetFormat())
tfile.SetFlavor(tfile.GetFormat(), (tflavor | aromodel))

# read MDL query and initialize the substructure search
opts = OEMDLQueryOpts_Default | OEMDLQueryOpts_SuppressExplicitH
qmol = OEQMol()

OEReadMDLQueryFile(qfile, qmol, opts)
ss = OESubSearch(qmol)

# loop over target structures
tindex = 1
for tmol in tfile.GetOEGraphMols():
    OEPrepareSearch(tmol, ss)
    if ss.SingleMatch(tmol):
        print("hit target =", tindex)
    tindex += 1

After opening the MDL query and the target files, the model used to assign aromaticity to the imported structures can be adjusted.

aromodel = OEIFlavor_Generic_OEAroModelMDL
qflavor = qfile.GetFlavor(qfile.GetFormat())
qfile.SetFlavor(qfile.GetFormat(), (qflavor | aromodel))
tflavor = tfile.GetFlavor(tfile.GetFormat())
tfile.SetFlavor(tfile.GetFormat(), (tflavor | aromodel))

If the aromaticity model is not specified for the input files, then the OpenEye aromaticity model is used by default. For more information about the various aromaticity models of OEChem see Aromaticity Perception.

OEReadMDLQueryFile(qfile, qmol, opts)
ss = OESubSearch(qmol)

In general, the aromaticity model chosen should be consistent between the query and target molecules to be searched. Using different aromaticity models may produce false negatives as aromatic systems may be treated differently. Section Aromaticity further explains the effects of using various aromaticity models when performing a substructure search.

OEReadMDLQueryFile function reads the MDL query directly into a OEQMolBase object , which then can be used to initialize an OESubSearch instance.

The MDL query structure can also be read into a OEMolBase object (see code snippet below). In this case, the OEReadMDLQueryFile function attaches the query features present in the input MDL file to the related atoms and bonds of the OEMolBase object. The OEQMolBase object can be subsequently created by calling the OEBuildMDLQueryExpressions function.

mol = OEGraphMol()
OEReadMDLQueryFile(qfile,mol)
# mol can be manipulated here
qmol = OEQMol()
# build OEQMol with OEMDLQueryOpts_Default option
OEBuildMDLQueryExpressions(qmol,mol)
ss = OESubSearch(qmol)

The declaration of these functions are:

OEReadMDLQueryFile(ifs,mol)
# ifs-oemolistream, mol-OEMolBase
# returns true or false
OEReadMDLQueryFile(ifs,qmol,opts)
# ifs-oemolistream, mol-OEQMolBase, opts-integer with OEMDLQueryOpts_Default
# returns true or false
OEBuildMDLQueryExpressions(qmol,mol,opts)
# ifs-oemolistream, mol-OEQMolBase, opts-integer with OEMDLQueryOpts_Default
# returns true or false

MDL Query Interpretation

The opts parameter defines how the MDL query is interpreted when an OEQMolBase object is constructed. The following options are present in the OEMDLQueryOpts namespace:

  1. OEMDLQueryOpts_Default

    Only constraints explicitly specified in the MDL file are added to the OEQMolBase query structure. See section Supported MDL Query Features for the supported MDL query features.

  2. OEMDLQueryOpts_SuppressExplicitH

    This option controls how the explicit hydrogens of the query are matched to the explicit/implicit hydrogens of the target structures. For more information see Explicit Hydrogens.

  3. OEMDLQueryOpts_AddBondAliphaticConstraint

    If this option is specified, then an aliphatic query bond can only be mapped to the aliphatic bonds in the target structure. Figure: Interpretation A shows how the MDL query structure is interpreted when the OEMDLQueryOpts_AddBondAliphaticConstraint option is used.

    Query ‘A’ will match all three of the target compounds displayed in Figure: Interpretation A. If the OpenEye model is used to perceive aromatic rings, then query ‘B’ substructure is present only in target ‘2’. If the MDL aromaticity model is used, then target ‘3’ is also a hit, since five-membered heterocycles are not considered aromatic in this model. For more information about different aromaticity models and their effects on substructure searching see section Aromaticity.

../_images/mdlq-ConstAliphatic.png

Interpretation A

Interpretation of the MDL query with ‘AddBondAliphaticConstraint’ option.
  1. OEMDLQueryOpts_AddBondTopologyConstraint

    By default, a bond that is part of any ring system in the query structure can only be mapped to ring bonds in the target structure. If the OEMDLQueryOpts_AddBondTopologyConstraint option is specified, constraints are also added to the chain bonds of the query in order to map them to only chain bonds in the target.

    Figure: Interpretation B shows how the MDL query structure is interpreted when the OEMDLQueryOpts_AddBondTopologyConstraint option is used. Query ‘A’ will match all three of the target compounds displayed in Figure: Interpretation B, while query ‘B’ is present only in target ‘3’.

    If the ‘ring’ property is specified in the MDL query file for a particular chain bond in the query structure, then its topology constraint is overridden.

../_images/mdlq-ConstTopology.png

Interpretation B

Interpretation of the MDL query with `AddBondTopologyConstrain’ option.
  1. OEMDLQueryOpts_MatchIsotope
  2. OEMDLQueryOpts_MatchAtomStereo

Supported MDL Query Features

Supported atom query features:

  1. The 8th column in the atom block is used to define the number of allowed hydrogens for an atom.
  2. Query atom types: A = any type except hydrogen Q = any type except hydrogen and carbon, L = atom list.
  3. M  ALS line in the property block is used to list alternative atom types for an atom.
  4. M  CHG line in the property block is used to define atom formal charges.
  5. M  RBC line in the property block is used to limit the number of allowed ring bonds attached to an atom.
  6. M  SUB line in the property block is used set the number of allowed substitutions of an atom.
  7. M  UNS line in the property block is be used to specify whether or not an atom is unsaturated, i.e., having at least one multiple bond.

Supported bond query features:

  1. Alternative bond types in the bond block (4 = aromatic, 5 = single or double, 6 = single or aromatic, 7 = double or aromatic, 8 = any bond).
  2. The 6th column in the atoms block describes bond topology (0 = either, 1 = ring , 2 = chain).
  3. Double bond stereochemistry is considered, if both ends of the bond are marked with stereo care flags in the atom block.
../_images/mdlq-Example-Query.png

Query

Example of MDL query structure.

Figure: Query shows an MDL query structure example with several different query features. For more details on atom and bond query features, please refer to the Accelrys CTFile Formats document (http://accelrys.com/products/collaborative-science/biovia-draw/ctfile-no-fee.html - registration required).

Aromaticity

Perceiving aromaticity in the query and the target structures is important in order to insure that the result of a substructure search is independent of the different Kekulé representations of the participating structures. A query bond which is part of an aromatic ring system can be mapped to any aromatic bonds in the target. Figure: Aromaticity Match shows an example where both Kekulé representations of the benzene-1,2-diol substructure are present in the two target structures.

../_images/mdlq-Aromaticity-Kekule.png

Aromaticity Match

The result of substructure search is independent of the Kekulé representations of the participating structures.

Altering the aromaticity model will affect the results of a substructure search. Figure: Aromaticity AFigure: Aromaticity C show several examples where different results were obtained by applying the MDL and the OpenEye aromaticity models. This is a consequence of the fact that in the MDL aromaticity model, five-membered heterocycles are not considered aromatic.

Note

It is highly recommended to apply the same aromaticity model to the query and the target structures.

Listing 1 shows an example of how to change the aromaticity flavor of input files.

../_images/mdlq-Aromaticity-A.png

Aromaticity A

Example of a substructure search applying different aromaticity models.
../_images/mdlq-Aromaticity-B.png

Aromaticity B

Example of a substructure search applying different aromaticity models.
../_images/mdlq-Aromaticity-C.png

Aromaticity C

Example of a substructure search applying different aromaticity models.

Aromaticity with Generic Atoms

If a ring in the query structure contains generic atom(s) (see example in Figure: Generic atom example A), then the aromaticity of the ring can not be perceived. In order to maintain the independence from the Kekulé representation, 6-membered rings with alternating single/double bonds are assumed to be aromatic.

../_images/mdlq-Aromaticity-Mem6-GenericAtom.png

Generic atom example A

Example of aromaticity assumption in a 6-membered ring with generic atoms.

Similarly, a 5-membered ring with generic atom(s) is considered aromatic if it is composed of two single and two double bonds. See Example in Figure: Generic atom example B.

../_images/mdlq-Aromaticity-Mem5-GenericAtom.png

Generic atom example B

Example of aromaticity assumption in a 5-membered ring with generic atoms.

Explicit Hydrogens

During the substructure search, each query atom has to be mapped to a target atom in order to detect subgraph isomorphism. Therefore, a problem can arise if the query structure contains explicit hydrogens or an atom list with hydrogens (see example in Figure: Query), but the target structure has implicit hydrogens Figure: Targets)

../_images/MDLQueryExplicitH.png

Example of MDL query structure with explicit hydrogens

Listing 2: Example of substructure search with accessing atom mapping

#!/usr/bin/env python
from __future__ import print_function
from openeye.oechem import *

qfile = oemolistream("query.mol")
tfile = oemolistream("targets.sdf")

# read MDL query and initialize the substructure search
qmol = OEQMol()
OEReadMDLQueryFile(qfile, qmol)
ss = OESubSearch(qmol)

# loop over target structures
tindex = 1
for tmol in tfile.GetOEGraphMols():
    OEAddExplicitHydrogens(tmol)
    OEPrepareSearch(tmol, ss)
    for mi in ss.Match(tmol, True):
        print("hit target = %d" % tindex, end=" ")
        for ai in mi.GetTargetAtoms():
            print ("%d%s" % (ai.GetIdx(), OEGetAtomicSymbol(ai.GetAtomicNum())), end=" ")
        print()
    tindex += 1

This problem can be solved in two ways:

  1. The explicit hydrogens in the query molecule can be suppressed during OEQMolBase construction by using the OEMDLQueryOpts_SuppressExplicitH option (see example in Listing 1). A query atom can only be mapped to a target atom if it has at least as many implicit hydrogens as explicit hydrogens were removed from the query atom. This solution is recommended only if the presence or absence of the query substructure is of interest, but not the complete mapping between the query and the target. The SingleMatch function returns whether or not the query is present in the target, but the mapping is not accessible.
  1. If the complete mapping between the query and the target is of interest, the OEAddExplicitHydrogens function has to be called for each target structure before performing the substructure search. (See example in Listing 2).

In both cases, the query presented in Figure: Query will match only target ‘C’ and ‘D’ shown in Figure: Targets. Figure: Atom mapping shows the three detected substructures in target ‘C’ when adding explicit hydrogens to the target structures.

The execution of the substructure search is significantly faster if the hydrogens are suppressed in the target structures, since the search space can be an order of magnitude smaller.

../_images/MDLQueryExplicitHMatch.png

Targets – Example of target structures with implicit hydrogens

../_images/MDLQueryExplicitHMapping.png

Atom mapping – Example of substructure search with atom mapping