Matched Molecule Pair Analysis¶
This overview is intended for users which have a working knowledge of matched pair analysis in the context of medicinal chemistry workflows. If the concept of matched pair analysis is unfamiliar, the following references are highly recommended as background to this powerful strategy ([Griffen-2011], [Kramer-2014], [Papadatos-2010], [Warner-2010]).
The OEMedChem TK provides the ability to index a set of input structures and identify matched molecular pairs. Matched Molecular Pair Analysis (MMPA) is becoming recognized as a powerful tool for the extraction of the effects of chemical changes on properties of interest in large data sets. One of the powers of the MMPA approach stems from the assumption that it is easier to predict differences in an activity or a property than to predict an actual value. A matched molecular pair is considered to be a pair of compounds that differ by a small localized change in a chemical substituent. The magnitude of the localized change that is considered acceptable is highly specific to the chemist or analyst. For a given matched molecular pair, one can consider the chemical difference between two compounds to embody a virtual chemical transformation with an accompanying change in a property of interest. For a homologous series of input structures, a set of matched molecular pairs for each transformation can be identified and analyzed for statistical relevance, sorting and/or ranking. Similarly, large input data sets can be mined for interesting or desired property changes as a function of structure.
A common method of identifying matched molecular pairs is an NxM
pairwise maximum common subgraph (MCS) analysis of an input data set.
Owing to the complexity and performance of MCS algorithms generally,
OpenEye instead chose to use a molecular fragmentation approach
(cf [Hussain-2010]) to identify matched molecular pairs and to deliver robust
performance characteristics over a wider variety of input data set
sizes. Additionally, this approach is entirely data-driven and does
not require a priori defined cores. A small set of fragmentation
strategies are currently provided, and future releases will expose
additional fragmentation types and capabilities. MMPA as a technique
is relatively new and there have been spirited discussions about the
applicability of the method when applied to various measured
properties. OpenEye encourages users to critically examine the
results available from such an analysis and to evaluate the
applicability or non-applicability of property predictions with
sound statistical analyses.
The general steps involved in an MMP analysis are,
Prepare the input structures
Prune input structure data to the property or properties of interest
Select indexing filters to identify the desired “localized change” in structures
Index the input structures
Extract transformations and/or matched molecular pairs and data
Optionally save the index for subsequent analyses
The internal index captures the common cores for the identified matched molecular pairs and the set of substituent changes associated with each core. Once the index has been generated, it can be used to,
Extract the virtual chemical transformations and set of delta properties for the MMPs
Extract the MCS cores of MMPs for clique or binning analysis
Extract statistics for property changes from a series of substituents on a common core
Identify or remove singletons or unrelated chemical frameworks from the input structures
Matched molecular pair transformations inherently have the concept of
a chemical context for any given set of matched pairs. At the site
(or sites) of the substituent differences in the two compounds one
can consider the accompanying effects of the adjacent common core
environment for the pair. To this end, the extraction of the matched
pair transformations supports the ability to tune the amount of the
adjacent common core to be included as part of the transformation. As
more chemical context is included, more differentiation among the set
of transformations occurs, generally resulting in fewer numbers of
matched molecular pairs associated with each transformation. Thus,
requesting a OEMatchedPairContext_Bond0
context when extracting transformations from an index is likely to
produce the largest set of matched pairs associated with each
transform, but at the expense of coalescing possibly unrelated
chemical environments into the same transformation bin. That is, a
Cl>>Br
transformation on a ring or aromatic ring is likely quite
different compared to the same transformation in a non-ring portion of
the structure with regards to the properties of interest for an MMP
analysis.
The OEMatchedPairAnalyzer is the utility class the provides the bulk of the Matched Pair Analysis capabilities in OEMedChem TK, see docs for additional details.
Introductory Usage of Matched Molecular Pair Analysis¶
As a trivial example, suppose a chemist is interested in extracting all the “simple” substituent changes in a large input set. One possible way to accomplish this is to define some limits to what constitutes a “simple” substituent change and run an MMP analysis of the input structures. Here we define a “simple” substituent change to be singly attached groups only, and where the size of the group cannot exceed 20% of the input structure heavy atom count.
# create options class with defaults mmpOpts = oemedchem.OEMatchedPairAnalyzerOptions() # for 'simple' pairs, alter default indexing options # - single cuts only, heavy atom substituents only (HMember indexing off) mmpOpts.SetOptions(oemedchem.OEMatchedPairOptions_SingleCuts | oemedchem.OEMatchedPairOptions_ComboCuts | oemedchem.OEMatchedPairOptions_UniquesOnly) # - limit substituent size to no more than 20% of input structure mmpOpts.SetIndexableFragmentRange(80., 100.) # create analyzer class with nondefault options mmpAnalyzer = oemedchem.OEMatchedPairAnalyzer(mmpOpts) # ignore common index status returns sIgnoreStatus = 'FragmentRangeFilter,DuplicateStructure,' sIgnoreStatus += 'FragmentationLimitFilter,HeavyAtomFilter' # index the input structures for recindex, mol in enumerate(ims.GetOEGraphMols(), start=1): # consider only the largest input fragment oechem.OEDeleteEverythingExceptTheFirstLargestComponent(mol) # ignore stereochemistry oechem.OEUncolorMol(mol, (oechem.OEUncolorStrategy_RemoveAtomStereo | oechem.OEUncolorStrategy_RemoveBondStereo)) # explicitly provide a 1-based index to refer to indexed structures # - to allow references back to external data elsewhere status = mmpAnalyzer.AddMol(mol, recindex) if status != recindex: if not oemedchem.OEMatchedPairIndexStatusName(status) in sIgnoreStatus: oechem.OEThrow.Warning("{0}: molecule indexing error, status={1}" .format(recindex, oemedchem.OEMatchedPairIndexStatusName(status))) # if limiting input, quit after limit if maxrecs and recindex >= maxrecs: break print("Index complete, matched pairs = {0}" .format(mmpAnalyzer.NumMatchedPairs())) # specify how transforms are extracted (direction and allowed properties) extractMode = (oemedchem.OEMatchedPairTransformExtractMode_Sorted | oemedchem.OEMatchedPairTransformExtractMode_NoSMARTS | oemedchem.OEMatchedPairTransformExtractMode_AddMCSCorrespondence) extractOptions = oemedchem.OEMatchedPairTransformExtractOptions() # specify amount of chemical context at the site of the substituent change # in the transform extractOptions.SetContext(oemedchem.OEMatchedPairContext_Bond0) extractOptions.SetOptions(extractMode) # walk the transforms and print the matched pairs xfmidx = 0 for mmpxform in oemedchem.OEMatchedPairGetTransforms(mmpAnalyzer, extractOptions): xfmidx += 1 print("{0:2} {1}".format(xfmidx, mmpxform.GetTransform())) # dump matched molecular pairs and index identifiers # (recindex from indexing loop above) for mmppair in mmpxform.GetMatchedPairs(): print("\tmatched pair molecule indices=({0},{1})".format(mmppair.FromIndex(), mmppair.ToIndex()))
Sample output of the example above is the following:
Index complete, matched pairs = 146
1 [c:1]Br>>[c:1]F
matched pair molecule indices=(4793,1225)
matched pair molecule indices=(1304,9796)
matched pair molecule indices=(9515,6129)
matched pair molecule indices=(4201,2611)
2 [c:1]Br>>[c:1]OCC
matched pair molecule indices=(7846,9727)
matched pair molecule indices=(1547,2939)
matched pair molecule indices=(5128,8928)
3 [c:1]Cl>>[c:1]F
matched pair molecule indices=(8116,7661)
matched pair molecule indices=(239,6631)
matched pair molecule indices=(9225,8434)
4 [c:1]F>>[c:1]OC
matched pair molecule indices=(3862,8684)
matched pair molecule indices=(1349,1434)
matched pair molecule indices=(416,3900)
5 [c:1]Br>>[c:1]SC
matched pair molecule indices=(6364,8527)
matched pair molecule indices=(1999,6845)
6 [c:1]C#N>>[c:1]F
matched pair molecule indices=(2760,5231)
matched pair molecule indices=(6677,241)
7 [c:1]C>>[c:1]Cl
matched pair molecule indices=(1191,1605)
matched pair molecule indices=(9591,4139)
8 [c:1]F>>[c:1]N(C)C
matched pair molecule indices=(4012,853)
matched pair molecule indices=(1779,1233)
9 [C:1]C>>[C:1]O
matched pair molecule indices=(2072,3429)
10 [C:1]C>>[C:1]OCC
matched pair molecule indices=(9803,9538)
...