Matched Molecule Pair Analysis

This overview is intended for users which have a working knowledge of matched pair analysis in the context of medicinal chemistry workflows. If the concept of matched pair analysis is unfamiliar, the following references are highly recommended as background to this powerful strategy ([Griffen-2011], [Kramer-2014], [Papadatos-2010], [Warner-2010]).

The OEMedChem TK provides the ability to index a set of input structures and identify matched molecular pairs. Matched Molecular Pair Analysis (MMPA) is becoming recognized as a powerful tool for the extraction of the effects of chemical changes on properties of interest in large data sets. One of the powers of the MMPA approach stems from the assumption that it is easier to predict differences in an activity or a property than to predict an actual value. A matched molecular pair is considered to be a pair of compounds that differ by a small localized change in a chemical substituent. The magnitude of the localized change that is considered acceptable is highly specific to the chemist or analyst. For a given matched molecular pair, one can consider the chemical difference between two compounds to embody a virtual chemical transformation with an accompanying change in a property of interest. For a homologous series of input structures, a set of matched molecular pairs for each transformation can be identified and analyzed for statistical relevance, sorting and/or ranking. Similarly, large input data sets can be mined for interesting or desired property changes as a function of structure.

A common method of identifying matched molecular pairs is an NxM pairwise maximum common subgraph (MCS) analysis of an input data set. Owing to the complexity and performance of MCS algorithms generally, OpenEye instead chose to use a molecular fragmentation approach (cf [Hussain-2010]) to identify matched molecular pairs and to deliver robust performance characteristics over a wider variety of input data set sizes. Additionally, this approach is entirely data-driven and does not require a priori defined cores. A small set of fragmentation strategies are currently provided, and future releases will expose additional fragmentation types and capabilities. MMPA as a technique is relatively new and there have been spirited discussions about the applicability of the method when applied to various measured properties. OpenEye encourages users to critically examine the results available from such an analysis and to evaluate the applicability or non-applicability of property predictions with sound statistical analyses.

The general steps involved in an MMP analysis are,

Prepare the input structures
Prune input structure data to the property or properties of interest
Select indexing filters to identify the desired “localized change” in structures
Index the input structures
Extract transformations and/or matched molecular pairs and data
Optionally save the index for subsequent analyses

The internal index captures the common cores for the identified matched molecular pairs and the set of substituent changes associated with each core. Once the index has been generated, it can be used to,

Extract the virtual chemical transformations and set of delta properties for the MMPs
Extract the MCS cores of MMPs for clique or binning analysis
Extract statistics for property changes from a series of substituents on a common core
Identify or remove singletons or unrelated chemical frameworks from the input structures

Matched molecular pair transformations inherently have the concept of a chemical context for any given set of matched pairs. At the site (or sites) of the substituent differences in the two compounds one can consider the accompanying effects of the adjacent common core environment for the pair. To this end, the extraction of the matched pair transformations supports the ability to tune the amount of the adjacent common core to be included as part of the transformation. As more chemical context is included, more differentiation among the set of transformations occurs, generally resulting in fewer numbers of matched molecular pairs associated with each transformation. Thus, requesting a OEMatchedPairContext::Bond0 context when extracting transformations from an index is likely to produce the largest set of matched pairs associated with each transform, but at the expense of coalescing possibly unrelated chemical environments into the same transformation bin. That is, a Cl>>Br transformation on a ring or aromatic ring is likely quite different compared to the same transformation in a non-ring portion of the structure with regards to the properties of interest for an MMP analysis.

The OEMatchedPairAnalyzer is the utility class the provides the bulk of the Matched Pair Analysis capabilities in OEMedChem TK, see docs for additional details.

Introductory Usage of Matched Molecular Pair Analysis

As a trivial example, suppose a chemist is interested in extracting all the “simple” substituent changes in a large input set. One possible way to accomplish this is to define some limits to what constitutes a “simple” substituent change and run an MMP analysis of the input structures. Here we define a “simple” substituent change to be singly attached groups only, and where the size of the group cannot exceed 20% of the input structure heavy atom count.

  // create options class with defaults
  OEMatchedPairAnalyzerOptions mmpOpts;

  // for 'simple' pairs, alter default indexing options
  // - single cuts only, heavy atom substituents only (HMember indexing off)
  mmpOpts.SetOptions(OEMatchedPairOptions::SingleCuts |
                     OEMatchedPairOptions::ComboCuts  |
                     OEMatchedPairOptions::UniquesOnly);
  // - limit substituent size to no more than 20% of input structure
  mmpOpts.SetIndexableFragmentRange(80.f, 100.f);

  // create analyzer class with nondefault options
  OEMatchedPairAnalyzer mmpAnalyzer(mmpOpts);

  // ignore common index status returns
  std::string sIgnoreStatus = "FragmentRangeFilter,DuplicateStructure,FragmentationLimitFilter,HeavyAtomFilter";

  // index the input structures
  int recindex = 0u;
  OEGraphMol mol;
  while (OEReadMolecule(ims, mol))
  {
    ++recindex;
    // consider only the largest input fragment
    OEDeleteEverythingExceptTheFirstLargestComponent(mol);
    // ignore stereochemistry
    OEUncolorMol(mol,
                 (OEUncolorStrategy::RemoveAtomStereo |
                  OEUncolorStrategy::RemoveBondStereo));

    // explicitly provide a 1-based index to refer to indexed structures
    //   - to allow references back to external data elsewhere
    int status = mmpAnalyzer.AddMol(mol, recindex);
    if (status != recindex)
    {
      if (sIgnoreStatus.find(OEMatchedPairIndexStatusName(status)) == std::string::npos)
        OEThrow.Warning("%d: molecule indexing error, status=%s",
                        recindex,OEMatchedPairIndexStatusName(status));
      // if limiting input, quit after limit
      if (maxrecs && recindex >= maxrecs)
        break;
    }
  }
  printf("Index complete, matched pairs = %d\n", mmpAnalyzer.NumMatchedPairs());

  OEMatchedPairTransformExtractOptions extractOptions;
  // specify amount of chemical context at the site of the substituent change
  //   in the transform
  extractOptions.SetContext(OEMatchedPairContext::Bond0);
  // specify how transforms are extracted (direction and allowed properties)
  extractOptions.SetOptions(OEMatchedPairTransformExtractMode::Sorted |
                            OEMatchedPairTransformExtractMode::NoSMARTS |
                            OEMatchedPairTransformExtractMode::AddMCSCorrespondence);

  // walk the transforms and print the matched pairs
  unsigned int xfmidx = 0u;
  for (OEIter<OEMatchedPairTransform> mmpxform = OEMatchedPairGetTransforms(mmpAnalyzer,extractOptions);
       mmpxform;
       ++mmpxform)
  {
    ++xfmidx;
    printf("%2d %s\n", xfmidx, mmpxform->GetTransform().c_str());
    // dump matched molecular pairs and index identifiers
    //   (recindex from indexing loop above)
    for (OEIter<OEMatchedPair> mmppair = mmpxform->GetMatchedPairs(); mmppair; ++mmppair)
      printf("\tmatched pair molecule indices=(%d,%d)\n",mmppair->FromIndex(),mmppair->ToIndex());
  }

Sample output of the example above is the following:

Index complete, matched pairs = 146
 1 [c:1]Br>>[c:1]F
  matched pair molecule indices=(4793,1225)
  matched pair molecule indices=(1304,9796)
  matched pair molecule indices=(9515,6129)
  matched pair molecule indices=(4201,2611)
 2 [c:1]Br>>[c:1]OCC
  matched pair molecule indices=(7846,9727)
  matched pair molecule indices=(1547,2939)
  matched pair molecule indices=(5128,8928)
 3 [c:1]Cl>>[c:1]F
  matched pair molecule indices=(8116,7661)
  matched pair molecule indices=(239,6631)
  matched pair molecule indices=(9225,8434)
 4 [c:1]F>>[c:1]OC
  matched pair molecule indices=(3862,8684)
  matched pair molecule indices=(1349,1434)
  matched pair molecule indices=(416,3900)
 5 [c:1]Br>>[c:1]SC
  matched pair molecule indices=(6364,8527)
  matched pair molecule indices=(1999,6845)
 6 [c:1]C#N>>[c:1]F
  matched pair molecule indices=(2760,5231)
  matched pair molecule indices=(6677,241)
 7 [c:1]C>>[c:1]Cl
  matched pair molecule indices=(1191,1605)
  matched pair molecule indices=(9591,4139)
 8 [c:1]F>>[c:1]N(C)C
  matched pair molecule indices=(4012,853)
  matched pair molecule indices=(1779,1233)
 9 [C:1]C>>[C:1]O
  matched pair molecule indices=(2072,3429)
10 [C:1]C>>[C:1]OCC
  matched pair molecule indices=(9803,9538)
...