Similarity Measures

Measuring molecular similarity or dissimilarity has two basic components: the representation of molecular characteristics (such as shape and color) and the similarity coefficient that is used to quantify the degree of resemblance between two such representations. Different similarity coefficients quantify different types of structural resemblance.

The table below defines the basic terms that are used in shape-based similarity calculations.

Basic components of similarity calculation

Symbol

Description

\(selfA\)

Self-overlap or self-color score for molecule A

\(selfB\)

Self-overlap or self-color score for molecule B

\(overlapAB\)

Overlap or color score between molecules A and B

Tanimoto

Formula:

\(Tanimoto_{A,B} = \frac{overlapAB}{selfA + selfB - overlapAB}\)

The Tanimoto similarity measure is symmetric and always has a value between 0.0 and 1.0 for both shape and color.

Tversky

Formula:

\(Tversky_{A,B} = \frac{overlapAB}{\alpha * selfA + \beta * selfB}\)

The Tversky similarity measure is asymmetric. Setting the parameters \(\alpha = \beta = 0.5\) makes it symmetric and somewhat identical to using the Tanimoto measure.

The factor \(\alpha\) weights the contribution of the first reference molecule. The larger \(\alpha\) becomes, the more weight is put on the self-overlap of the reference molecule.

Unlike Tanimoto similarity, Tversky similarity may not always have a value between 0.0 and 1.0. This is true for both shape and color. Depending on the number of atoms between molecules A and B, it is possible to have \(|overlapAB| > |selfA|\), and that along with certain values of \(\alpha\) can sometimes lead to \(Tversky_{A,B} > 1.0\).

Note

The default settings for RefTversky use \(\alpha = 0.95\) and \(\beta = 0.05\), and the default settings for FitTversky use \(\alpha = 0.05\) and \(\beta = 0.95\).