Pattern Matching

OEChem TK includes facilities to perform different types of pattern (graph) matching. Graph matching is based on node (atom) and edge (bond) correspondences. An atom which satisfies the conditions of a node in a query graph is said to match. Likewise, a bond which satisfies the conditions of an edge in a query graph is said to match. Pattern matching is the process of identifying groupings of matching nodes and edges. Substructure search, or subgraph isomorphism, is the process of finding a graph match which is less than or equal to a larger graph. Maximum common substructure search is the process of identifying the maximal graph correspondence between two graphs. Clique detection is the process of finding all possible correspondences between two graphs within a set of bounds.

Exhaustive and approximate MCSS

The maximum common substructure search can be performed in two different modes: a very fast method, OEMCSType::Approximate, or a more comprehensive method, OEMCSType::Exhaustive.

The type of the OEMCSSearch can be set at initialization. The default value is OEMCSType::Exhaustive.

The approximate method is based on traversing through pre-defined paths of the query structure and trying to map the visited query atoms into target atoms. Because these pre-defined paths represent only a fraction of all possible paths of a compound, it is not guaranteed that the approximate method can find the largest and all common substructures. Significant difference between the detected matches of the two methods could exist in cases when the query or target structure contains complex ring systems (fused or bridged) or stereo centers. However, comparing the two methods for thousands of structures revealed that these cases are rare and the approximate method provides a good trade-off between identifying MCS matches accurately and doing it 3 to 6 times faster than the exhaustive method.

Figure: Approximate MCSS and Figure: Exhaustive MCSS, show an example where the substructure identified by the approximate method is smaller by one atom then the solution identified by the exhaustive method.

../_images/OEMCSSearchApproximate.png

Approximate MCSS

Example for maximum common substructure identified by the approximate method.

../_images/OEMCSSearchExhaustive.png

Exhaustive MCSS

Example for maximum common substructure identified by the exhaustive method.

Note

Using the approximate MCS is recommended if the speed of the search is crucial.

MCS scoring functions

OEMCSFunc is an abstract base-class that defines the API used for scoring subgraph matches. The scores generated by implementations of OEMCSFunc influence the sorting and retention of maximum common subgraph matches generated by the OEMCSSearch class. The scoring function is set by the OEMCSSearch::SetMCSFunc method.

It is important to mention that using different scoring functions does not alter the way the search space is traversed to identify common substructures. It affects only how these identified substructures are evaluated.

Four implementations of the OEMCSFunc class are available in OEChem TK:

OEMCSMaxAtoms

The OEMCSMaxAtoms class is designed to order maximum common substructure matches by the maximum number of atoms included in the graph match. If two matches have the same number of atoms, then the tie is split based on the number of bonds contained in the match. (See example in Scoring A.)

Scoring function:

\(num.\ of\ mapped\ atoms + \frac{num.\ of\ mapped\ bonds}{100}\)

../_images/OEMCSSearchOEMCSMaxAtoms.png

Scoring A

Retrieved matches using ‘OEMCSMaxAtoms’ as scoring function.

OEMCSMaxBonds

The OEMCSMaxBonds class is designed to order maximum common substructure matches by the maximum number of bonds included in the graph match. If two matches have the same number of bonds, then the tie is split based on the number of atoms contained in the match. (See example in Scoring B.)

Scoring function:

\(num.\ of\ mapped\ bonds + \frac{num.\ of\ mapped\ atoms}{100}\)

../_images/OEMCSSearchOEMCSMaxBonds.png

Scoring B

Retrieved matches using ‘OEMCSMaxBonds’ as scoring function.

OEMCSMaxAtomsCompleteCycles

The OEMCSMaxAtomsCompleteCycles class is the same as the OEMCSMaxAtoms with the addition of penalizing cyclic query bonds that are not mapped to any target bonds, thereby giving priority to matches which contain complete cycles common to both the pattern and the target structure. (See example in Scoring C.)

Scoring function:

\(num.\ of\ mapped\ atoms + \frac{num.\ of\ mapped\ bonds}{100} - penalty \times num.\ of\ unmapped\ cyclic\ query\ bonds\)

The default penalty for each unmapped cyclic query bond is 1.0.

../_images/OEMCSSearchOEMCSMaxAtomsCompleteCycles.png

Scoring C

Retrieved matches using ‘OEMCSMaxAtomsCompleteCycles’ as scoring function.

OEMCSMaxBondsCompleteCycles

The OEMCSMaxBondsCompleteCycles class is the same as the OEMCSMaxBonds class with the addition of penalizing cyclic query bonds that are not mapped to any target bonds, thereby giving priority to matches which contain complete cycles common to both the pattern and the target structure. (See example in Scoring D.)

Scoring function:

\(num.\ of\ mapped\ bonds + \frac{num.\ of\ mapped\ atoms}{100} - penalty \times num.\ of\ unmapped\ cyclic\ query\ bonds\)

The default penalty for each unmapped cyclic query bond is 1.0.

../_images/OEMCSSearchOEMCSMaxAtomsCompleteCycles.png

Scoring D

Retrieved matches using ‘OEMCSMaxBondsCompleteCycles’ as scoring function.

It is important to remember that not only matches with the highest score are retained, but also matches with scores higher than the best score rounded down to the highest integer. In the example shown in Scoring B three common substructures are detected using the OEMCSMaxBonds scoring function. The first two matches are scored 5.06, since they are composed of 5 mapped bonds and 6 mapped atoms. There is only one other match which scored higher than 5.0, this is the third retained match with a score of 5.05.

OEExprOpts Namespace

Pattern matching in OEChem TK is always done using query molecules or query graphs. Non-query molecules, i.e. those that are defined only by OEMolBase abstract base-class must be converted into a query molecule. Conversion into a query molecule is controlled using the values in the OEExprOpts namespace. Expression options can either be specified in the constructor for an OEQMol, or using the convenience constructors in the pattern matching classes, OESubSearch, OEMCSSearch, OECliqueSearch which take expression options as arguments.

Figure: Default example shows an example where the maximum common substructure search is performed using the OEExprOpts::DefaultAtoms and OEExprOpts::DefaultBonds options.

../_images/OEExprOpts-Default-Default.png

Default example

Example of maximum common substructure search with ‘DefaultAtoms’ and ‘DefaultBonds’ options

The OEExprOpts::DefaultAtoms option means that two atoms are considered to be equivalent if they have the same atomic number, aromaticity, and formal charge. The OEExprOpts::DefaultBonds option means that two bonds can be mapped to each other if they have the same bond order and aromaticity.

Listing 7: MCSS with atom and bond expressions

#include <openeye.h>
#include <oeplatform.h>
#include <oesystem.h>
#include <oechem.h>

using namespace OEPlatform;
using namespace OESystem;
using namespace OEChem;

int main()
{
  OEGraphMol pattern;
  OEGraphMol target;
  OESmilesToMol(pattern, "c1(cc(nc2c1C(CCC2)Cl)CCl)O");
  OESmilesToMol(target,  "c1(c2c(nc(n1)CF)COC=C2)N");

  const unsigned int atomexpr = OEExprOpts::DefaultAtoms;
  const unsigned int bondexpr = OEExprOpts::DefaultBonds;

  OEQMol patternQ(pattern);
  // generate query with atom and bond expression options
  patternQ.BuildExpressions(atomexpr, bondexpr);
  OEMCSSearch mcss(patternQ.QMol());

  const bool unique = true;
  unsigned int count = 1u;
  // loop over matches
  for (OEIter<const OEMatchBase> match = mcss.Match(target, unique); match; ++match)
  {
    oeout << "Match " << count << ':' << oeendl;
    oeout << "Number of matched atoms: " << match->NumAtoms() << oeendl;
    oeout << "Number of matched bonds: " << match->NumBonds() << oeendl;
    // create match subgraph
    OEGraphMol mol;
    const bool adjustHCount = true;
    OESubsetMol(mol, match, adjustHCount);
    oeout << "match smiles = " << OEMolToSmiles(mol) << oeendl;
    ++count;
  }

  return 0;
}

The best way to understand how various atom and bond expressions influence the pattern matching is to change the atom and bond expressions in Listing 7 and compare the obtained matches.

const unsigned int atomexpr = OEExprOpts::DefaultAtoms;
const unsigned int bondexpr = OEExprOpts::DefaultBonds;

After constructing the pattern molecule, the BuildExpressions method defines the level of atom and bond matching between the pattern molecule and any target molecule.

patternQ.BuildExpressions(atomexpr, bondexpr);

By modifying the atom and bond expression options, very diverse pattern matching can be performed. Figure: Example AFigure: Example E show several examples where maximum common substructure searches are performed for the same query and target molecules, but with various atom and bond expression options.

In the first example in Figure: Example A, the OEExprOpts::ExactAtoms expression option is used to give a higher degree of discrimination of the equivalence of atoms, i.e. atoms can only be mapped to each other if they have the same degree, number of hydrogens, chirality, mass, and ring membership, in addition to the requirements of the OEExprOpts::DefaultAtoms option.

../_images/OEExprOpts-Exact-Default.png

Example A

`ExactAtoms` and ‘DefaultBonds`

Figure: Example BFigure: Example E show examples where the discrimination capability of the OEExprOpts::DefaultAtoms is decreased by adding various modifiers. For example, using the OEExprOpts::EqAromatic modifier, atoms in any aromatic ring systems are considered equivalent. As a result, the pyridine and pyrimidine ring can be mapped to each other in Figure: Example B. Similarly, OEExprOpts::EqHalogen (Figure: Example C) and OEExprOpts::EqONS (Figure: Example D) define equivalency between halogen atoms and oxygen-nitrogen-sulfur atoms, respectively. Using OEExprOpts::EqCAliphaticONS (Figure: Example E) an aliphatic query carbon atom is considered equivalent to any oxygen, nitrogen, or sulfur atom.

../_images/OEExprOpts-DefaultEqArom-Default.png

Example B

‘DefaultAtoms|EqAromatic’ and ‘DefaultBonds’

../_images/OEExprOpts-DefaultEqHal-Default.png

Example C

‘DefaultAtoms|EqHalogen’ and ‘DefaultBonds’

../_images/OEExprOpts-DefaultEqONS-Default.png

Example D

‘DefaultAtoms|EqONS’ and ‘DefaultBonds’

../_images/OEExprOpts-DefaultEqCAlipONS-Default.png

Example E

‘DefaultAtoms|EqCAliphaticONS’ and ‘DefaultBonds’

Similar modifiers exist for altering bond equivalency. Figure: Example F shows an example where single and double bonds are considered identical when OEExprOpts::EqSingleDouble modifier is utilized.

../_images/OEExprOpts-Default-DefaultEqSingleDouble.png

Example F

‘DefaultAtoms’ and ‘DefaultBonds|EqSingleDouble’

The last example in Figure: Example G represents a very unrestrained search, where both the atom and bond expression options have weak discrimination power.

../_images/EqAromEqCAlONSEqHalEqONSEqSnglDbl.png

Example G

‘DefaultAtoms|EqAromatic|EqCAliphaticONS|EqHalogen|EqONS’ and ‘DefaultBonds|EqSingleDouble’

Even though only maximum common substructure search examples are presented here, atom and bond expression options can be similarly used with substructure searches or clique detections. For a full description of expression options and their usage please refer to the OEExprOpts namespace section in OEChem TK API.