You want to find the core fragment in a molecule series i.e. find the largest common substructure of the molecules in a series. See example in Figure 1.
The obvious choice to solve this problem is to use a maximum common substructure (MCS) search algorithm.
MCS is commonly used in cheminformatics to assess molecule similarity. One advantage of MCS over the fingerprinting methods, that it provides an atom - atom mapping between two molecules that is essential to solve this problem.
However, MCS identification belongs to the class of NP-complete problems for which no algorithm with polynomial-time complexity is known for the general case. The identification of all subgraphs containing k nodes that are common to two graphs containing n and m nodes, respectively, requires
atom-by-atom comparisons. The identification of the largest common subgraph is achieved by carrying out this set of comparisons with \(\forall k : 1 < k < min(m, n)\) until it is not possible to identify a larger common substructure. Even though chemical structures do not have high connectivity, which would lead to exponential behavior, it still can be a very expensive operation for large molecules.
The other problem is that the MCS search algorithm implemented in the OEMCSSearch class only works for two molecules. Identifying a MCS for a molecule set would require performing \(n x n\) comparisons. Combining the returned \(n x n\) MCS atom mappings is not a trivial problem either. See example in Figure 3.
However the problem we want to solve is quite specific that does not require to perform \(n x n\) MCS searches. If we can assume that a common core does exist in the molecule series than the following process can identify this core fragment:
The first step of finding a core of a molecule series is to select an arbitrary reference molecule. The GetReferenceMolecule function simply selects the molecule with the smallest number of atoms.
1 2 3 4 5 6 7 8
def GetReferenceMolecule(mollist): refmol = None for mol in mollist: if refmol is None or mol.NumAtoms() < refmol.NumAtoms(): refmol = mol return refmol
The GetFragments function is then called with the reference molecule to enumerate all of its fragments by using functions of GraphSim TK. When generating fingerprints, a molecule graph is exhaustively traversed, enumerating various fragments (using the path, circular or tree method), and then hashing them into a fixed-length bit-vector. The OEGetFPCoverage function of GraphSim TK provides access to these fragments by returning an iterator over OEAtomBondSet objects, each of which stores the atoms and bonds of a specific fragment. See example in Figure 4 that shows how tree fragments are enumerated up to a given length.
The GetFragments function first generates a tree fingerprint type from a string representation using the OEGetFPType function. Then it loops over the OEAtomBondSet objects returned by the OEGetFPCoverage function and extracts the corresponding molecule fragment using the OESubsetMol function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
def GetFragments(mol, minbonds, maxbonds): frags =  fptype = oegraphsim.OEGetFPType("Tree,ver=2.0.0,size=4096,bonds=%d-%d,atype=AtmNum,btype=Order" % (minbonds, maxbonds)) for abset in oegraphsim.OEGetFPCoverage(mol, fptype, True): fragatompred = oechem.OEIsAtomMember(abset.GetAtoms()) frag = oechem.OEGraphMol() adjustHCount = True oechem.OESubsetMol(frag, mol, fragatompred, adjustHCount) oechem.OEFindRingAtomsAndBonds(frag) frags.append(oechem.OEGraphMol(frag)) return frags
When using the OEGetFPCoverage function only fragment size parameter of the fingerprint type is considered. The other parameters such as atom and bond typing are utilized only when hashing the enumerated fragment into a fixed-length bit-vector.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
def GetCommonFragments(mollist, frags, atomexpr=oechem.OEExprOpts_DefaultAtoms, bondexpr=oechem.OEExprOpts_DefaultBonds): corefrags =  for frag in frags: ss = oechem.OESubSearch(frag, atomexpr, bondexpr) if not ss.IsValid(): continue validcore = True for mol in mollist: validcore = ss.SingleMatch(mol) if not validcore: break if validcore: corefrags.append(frag) return corefrags
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
def GetCoreFragment(mols, minbonds, maxbonds, atomexpr=oechem.OEExprOpts_DefaultAtoms, bondexpr=oechem.OEExprOpts_DefaultBonds): print("Number of molecules = %d" % len(mols)) refmol = GetReferenceMolecule(mols) frags = GetFragments(refmol, minbonds, maxbonds) if len(frags) == 0: oechem.OEThrow.Error("No fragment is enumertated with bonds %d-%d!" % (minbonds, maxbonds)) print("Number of fragments = %d" % len(frags)) commonfrags = GetCommonFragments(mols, frags, atomexpr, bondexpr) if len(commonfrags) == 0: oechem.OEThrow.Error("No common fragment is found!") print("Number of common fragments = %d" % len(commonfrags)) core = None for frag in commonfrags: if core is None or GetFragmentScore(core) < GetFragmentScore(frag): core = frag return core
GetFragmentScore function gives priority to larger fragments and fragments with more ring atoms.
1 2 3 4 5 6 7
def GetFragmentScore(mol): score = 0.0 score += 2.0 * oechem.OECount(mol, oechem.OEAtomIsInRing()) score += 1.0 * oechem.OECount(mol, oechem.OENotAtom(oechem.OEAtomIsInRing())) return score
The following example shows the output of the getcore.py script:
prompt > python3 getcore.py -in getcore-test.smi Number of molecules = 459 Number of fragments = 522 Number of common fragments = 372 Core fragment = Cc1cc2cc(ccc2nc1N)O
By modifying the atom and bond expression options, very diverse pattern matching can be performed. Example(B) - Example (D) in Table 1 show examples where the discrimination capability of the OEExprOpts_DefaultAtoms and OEExprOpts_DefaultBonds options are decreased by using various modifiers. This results in identifying larger and larger “common” core fragments. For example, using the OEExprOpts_EqAromatic modifier, atoms in any aromatic ring systems are considered equivalent. As a result, a larger core fragment is identified since the pyridine and pyrimidine rings are considered equivalent.
|Example (A) – DefaultAtoms and DefaultBonds|
|Example (B) – DefaultAtoms | EqAromatic and DefaultBonds|
|Example (C) – DefaultAtoms | EqHalogen and DefaultBonds|
|Example (D) – DefaultAtoms | EqAromatic | EqCAliphaticONS and DefaultBonds | EqSingleDouble|