User-defined Fingerprint

The previous Fingerprint Generation chapter showed how to create circular, path and tree fingerprints with default parameters. These default parameters are calibrated on the Briem-Lessel [Briem-Lessel-2000], Hert-Willett [Hert-Willett-2004] and Grant [Grant-2006] benchmarks.

However, the GraphSim TK also provides facilities to construct user-defined fingerprints. When constructing a user-defined fingerprint, the following parameters have to be considered:

  1. Atom and bond typing that define which atom and bond properties are encoded into the fingerprints (see the Atom and Bond Typing section)
  2. Size of the fragments that are exhaustively enumerated during the fingerprint generation (see the Fragment Size section)
  3. Size of the generated fingerprint (in bits) (see the Fingerprint Size section)

The following code snippet shows how to generate a 1024 bit long fingerprint that encodes paths from 0 up to 5 bonds in length with default atom and bond properties defined by the OEFPAtomType.DefaultAtom and OEFPBondType.DefaultBond constants, respectively.

uint numbits = 1024;
uint minbonds = 0;
uint maxbonds = 5;
OEGraphSim.OEMakePathFP(fp,
                        mol,
                        numbits,
                        minbonds,
                        maxbonds,
                        OEFPAtomType.DefaultAtom,
                        OEFPBondType.DefaultBond);

Warning

Two fingerprints which are generated with different parameters will have different fingerprint types!

In Listing 14, two fingerprints are generated with different parameters, namely they have a different number of bits. This means that they also have different types, therefore, no similarity value can be calculated between them.

Listing 14: Example of different path fingerprint types

using System;
using OpenEye.OEChem;
using OpenEye.OEGraphSim;

public class PathFPType
{
    public static int Main(string[] args)
    {
        OEGraphMol mol = new OEGraphMol();
        OEChem.OESmilesToMol(mol, "c1ccccc1");

        OEFingerPrint fpA = new OEFingerPrint();
        uint numbits = 1024;
        uint minbonds = 0;
        uint maxbonds = 5;
        OEGraphSim.OEMakePathFP(fpA,
                                mol,
                                numbits,
                                minbonds,
                                maxbonds,
                                OEFPAtomType.DefaultAtom,
                                OEFPBondType.DefaultBond);
        OEFingerPrint fpB = new OEFingerPrint();
        numbits = 2048;
        OEGraphSim.OEMakePathFP(fpB,
                                mol,
                                numbits,
                                minbonds,
                                maxbonds,
                                OEFPAtomType.DefaultAtom,
                                OEFPBondType.DefaultBond);

        Console.WriteLine("same fingerprint types = " + OEGraphSim.OEIsSameFPType(fpA, fpB));
        Console.WriteLine("{0:0.000}", OEGraphSim.OETanimoto(fpA, fpB));
        return 0;
    }
}

The output of Listing 14 is the following:

same fingerprint types = False
Fatal: fingerprint type mismatch!

Atom and Bond Typing

Listing 15 shows how to generate fingerprints for two molecules with various atom and bond types (depicted in Example molecules). Reducing the number of atom and bond properties increases the similarity between the two molecules (i.e. their Tanimoto similarity). At the end, when only the topology of two molecules is considered, i.e., whether or not their atoms and bonds belong to any ring system, the fingerprints of the two molecules become identical.

These effects are illustrated in Table: Examples of depiction molecule similarity based on fingerprints.

Listing 15: Similarity calculation with various atom/bond typing

using System;
using OpenEye.OEChem;
using OpenEye.OEGraphSim;

public class FPAtomTyping
{
    public static int Main(string[] args)
    {
        OEGraphMol molA = new OEGraphMol();
        OEChem.OESmilesToMol(molA, "Oc1c2c(cc(c1)CF)CCCC2");
        OEGraphMol molB = new OEGraphMol();
        OEChem.OESmilesToMol(molB, "c1ccc2c(c1)c(cc(n2)CCl)N");

        PrintTanimoto(molA, molB, OEFPAtomType.DefaultAtom, OEFPBondType.DefaultBond);
        PrintTanimoto(molA, molB, OEFPAtomType.DefaultAtom | OEFPAtomType.EqAromatic, OEFPBondType.DefaultBond);
        PrintTanimoto(molA, molB, OEFPAtomType.Aromaticity, OEFPBondType.DefaultBond);
        PrintTanimoto(molA, molB, OEFPAtomType.InRing, OEFPBondType.InRing);
        return 0;
    }

    private static void PrintTanimoto(OEMolBase molA, OEMolBase molB, uint atype, uint btype)
    {
        OEFingerPrint fpA = new OEFingerPrint();
        OEFingerPrint fpB = new OEFingerPrint();
        uint numbits = 2048;
        uint minb = 0;
        uint maxb = 5;
        OEGraphSim.OEMakePathFP(fpA, molA, numbits, minb, maxb, atype, btype);
        OEGraphSim.OEMakePathFP(fpB, molB, numbits, minb, maxb, atype, btype);
        Console.WriteLine("Tanimoto(A,B) = {0:0.000}", OEGraphSim.OETanimoto(fpA, fpB));
    }
}
../_images/FingerPrintAtomTypingMolecules.png

Example molecules

The output of Listing 15 is the following:

Tanimoto(A,B) = 0.166
Tanimoto(A,B) = 0.241
Tanimoto(A,B) = 0.592
Tanimoto(A,B) = 1.000
Examples of depiction molecule similarity based on fingerprints
../_images/FingerPrintAtomTypingExampleA.png ../_images/FingerPrintAtomTypingExampleB.png
../_images/FingerPrintAtomTypingExampleC.png ../_images/FingerPrintAtomTypingExampleD.png

Table: Atom typing options and Table: Bond typing options list the currently available typing options.

Atom typing options
atom typing constant encoded atom property
OEFPAtomType.Aromaticity OEAtomBase.IsAromatic
OEFPAtomType.AtomicNumber OEAtomBase.GetAtomicNum
OEFPAtomType.Chiral OEAtomBase.IsChiral
OEFPAtomType.FormalCharge OEAtomBase.GetFormalCharge
OEFPAtomType.HCount OEAtomBase.GetTotalHCount
OEFPAtomType.HvyDegree OEAtomBase.GetHvyDegree
OEFPAtomType.Hybridization OEAtomBase.GetHyb
OEFPAtomType.InRing OEAtomBase.IsInRing
Atomic number modifiers
OEFPAtomType.EqAromatic  
OEFPAtomType.EqHalogen  
OEFPAtomType.EqHBondAcceptor  
OEFPAtomType.EqHBondDonor  
Bond typing options
bond typing constant encoded bond property
OEFPBondType.BondOrder GetOrder
OEFPBondType.Chiral OEBondBase.IsChiral
OEFPBondType.InRing OEBondBase.IsInRing

See also

Fragment Size

Circular, path and tree-based fingerprint generation involves molecular graph traversal to identify all unique radial, linear or branched fragments, respectively. When a path or tree fingerprint is initialized, the minimum and maximum number of bonds of the fragments that are encoded into the fingerprint can be specified. See Figure: Example of enumerated path fragments with increasing number of bonds and Figure: Example of enumerated tree fragments with increasing number of bonds.

../_images/PathEnumerationLengths.png

Example of enumerated path fragments with increasing number of bonds

../_images/TreeEnumerationLengths.png

Example of enumerated tree fragments with increasing number of bonds

In case of a circular fingerprint, the minimum and maximum radius of the enumerated fragments can be specified. See Figure: Example of enumerated circular fragments with increasing radius

../_images/CircularEnumerationLengths.png

Example of enumerated circular fragments with increasing radius

For example, when generating a fingerprint of the molecule shown in Figure: Example Molecule with minimum and maximum length set to 0 and 3, respectively, only paths listed in the first four rows in Table: Enumerated Paths, are encoded into the fingerprint.

../_images/PathLengthExampleMolecule.png

Example molecule

Enumerated paths
Path length (in bonds) Generated Unique Paths
0 C, N, O
1 C-C, C-N, C-O
2 C-C-C, C-C-N, C-C-O, C-N-C, N-C-O
3 C-C-C-C, C-C-C-N, C-C-C-O, C-C-N-C, C-N-C-O,
4 C-C-C-C-C, C-C-C-C-N, C-C-C-C-O, C-C-C-N-C, C-C-N-C-C, O-C-N-C-C
5 C-C-C-C-C-N, C-C-C-C-C-O, C-C-C-C-N-C, C-C-C-N-C-C, C-C-C-N-C-O

Figure: Example of enumerated paths depicts the six unique paths of length four that are generated for the example molecule. Each unique path is encoded only once without considering its frequency.

../_images/PathLengthExnumerationExample.png

Example of enumerated paths

In the example shown in Listing 16, fingerprints with various minimum and maximum path length are generated for pyrrole and pyridine. When enumerating only paths that are shorter than four bonds, the fingerprints generated for the two molecules are identical. Since the four bond-length pattern ccccc is present in pyridine but not in pyrrole, the fingerprints become different, resulting in a smaller Tanimoto similarity score.

Listing 16: Similarity calculation with various path lengths

using System;
using OpenEye.OEChem;
using OpenEye.OEGraphSim;

public class FPPathLength
{
    public static int Main(string[] args)
    {
        OEGraphMol molA = new OEGraphMol();
        OEChem.OESmilesToMol(molA, "c1ccncc1");
        OEGraphMol molB = new OEGraphMol();
        OEChem.OESmilesToMol(molB, "c1cc[nH]c1");

        PrintTanimoto(molA, molB, 0, 3);
        PrintTanimoto(molA, molB, 1, 3);
        PrintTanimoto(molA, molB, 0, 4);
        PrintTanimoto(molA, molB, 0, 5);
        return 0;
    }

    private static void PrintTanimoto(OEMolBase molA, OEMolBase molB, uint minb, uint maxb)
    {
        OEFingerPrint fpA = new OEFingerPrint();
        OEFingerPrint fpB = new OEFingerPrint();
        uint numbits = 2048;
        uint atype = OEFPAtomType.DefaultAtom;
        uint btype = OEFPBondType.DefaultBond;
        OEGraphSim.OEMakePathFP(fpA, molA, numbits, minb, maxb, atype, btype);
        OEGraphSim.OEMakePathFP(fpB, molB, numbits, minb, maxb, atype, btype);
        Console.WriteLine("Tanimoto(A,B) = {0:0.000}", OEGraphSim.OETanimoto(fpA, fpB));
    }
}

The output of Listing 16 is the following:

Tanimoto(A,B) = 1.000
Tanimoto(A,B) = 1.000
Tanimoto(A,B) = 0.950
Tanimoto(A,B) = 0.731

Fingerprint Size

The previous sections explain how the atom and bond typing and encoded fragment size can effect the similarity scores. Selecting an adequate fingerprint size is also very crucial. The number of unique circular, path or tree fragments present in molecular structures can be extremely large, therefore the generated fragments have to be hashed into the fixed-length fingerprint. This means that a bit in a fingerprint does not correspond to a unique pattern exclusively (as it does in structural key). Also a bit has no particular structural meaning, i.e., each bit represents the presence of a number of structural patterns.

The smaller the size of the fingerprints, the more dense they become, raising the probability of collisions. A collision occurs when different fragments are mapped to the same bit. This will inherently result in information loss and weaken the power to discriminate between structurally similar and dissimilar molecules. On the other hand, when the size of the fingerprints is too large they become very sparse, which will reduce information loss. However, the time spent to calculate similarity scores will increase.

The following table shows the number of unique paths generated for benzylpenicillin (depicted in Figure: Benzylpenicillin).

Note

The more atom and bond properties that are taken into account and the larger the size of paths to enumerate, the larger the size of the fingerprint has to be in order to encode the enumerated fragments without a significant number of bit collisions.

../_images/Benzylpenicillin.png

Benzylpenicillin

Number of unique paths generated for Benzylpenicillin
Atom/Bond typing path 0-3 path 0-5 path 0-7
AtomicNumber, BondOrder 56 149 297
AtomicNumber | HvyDegree, BondOrder 111 265 453
AtomicNumber | HvyDegree | Aromaticity, BondOrder | InRing 126 297 499
DefaultAtom, DefaultBond 147 362 617