Similarity Measures
Measuring molecular similarity or dissimilarity has two basic components: the representation of molecular characteristics (such as shape and color) and the similarity coefficient that is used to quantify the degree of resemblance between two such representations. Different similarity coefficients quantify different types of structural resemblance.
The table below defines the basic terms that are used in shape-based similarity calculations.
Symbol |
Description |
|---|---|
\(selfA\) |
Self-overlap or self-color score for molecule A |
\(selfB\) |
Self-overlap or self-color score for molecule B |
\(overlapAB\) |
Overlap or color score between molecules A and B |
Tanimoto
Formula:
\(Tanimoto_{A,B} = \frac{overlapAB}{selfA + selfB - overlapAB}\)
The Tanimoto similarity measure is symmetric and always has a value between 0.0 and 1.0 for both shape and color.
Tversky
Formula:
\(Tversky_{A,B} = \frac{overlapAB}{\alpha * selfA + \beta * selfB}\)
The Tversky similarity measure is asymmetric. Setting the parameters \(\alpha = \beta = 0.5\) makes it symmetric and somewhat identical to using the Tanimoto measure.
The factor \(\alpha\) weights the contribution of the first reference molecule. The larger \(\alpha\) becomes, the more weight is put on the self-overlap of the reference molecule.
Unlike Tanimoto similarity, Tversky similarity may not always have a value between 0.0 and 1.0. This is true for both shape and color. Depending on the number of atoms between molecules A and B, it is possible to have \(|overlapAB| > |selfA|\), and that along with certain values of \(\alpha\) can sometimes lead to \(Tversky_{A,B} > 1.0\).
Note
The default settings for RefTversky use \(\alpha = 0.95\) and
\(\beta = 0.05\), and the default settings for FitTversky use
\(\alpha = 0.05\) and \(\beta = 0.95\).