Similarity Measures

Measuring molecular similarity or dissimilarity has two basic components: the representation of molecular characteristics (such as shape and color) and the similarity coefficient that is used to quantify the degree of resemblance between two such representations. Different similarity coefficients quantify different types of structural resemblance.

The table below defines the basic terms that are used in shape based similarity calculations:

Basic components of similarity calculation

Symbol

Description

\(selfA\)

Self overlap or self color score for molecule A

\(selfB\)

Self overlap or self color score for molecule B

\(overlapAB\)

Overlap or color score between molecules A and B

Tanimoto

Formula:

\(Tanimoto_{A,B} = \frac{overlapAB}{selfA + selfB - overlapAB}\)

The Tanimoto similarity measure is symmetric, and always has a value between 0.0 and 1.0 for both shape and color.

Tversky

Formula:

\(Tversky_{A,B} = \frac{overlapAB}{\alpha * selfA + \beta * selfB}\)

The Tversky similarity measure is asymmetric. Setting the parameters \(\alpha = \beta = 0.5\) makes it symmetric and somewhat identical to using the Tanimoto measure.

The factor \(\alpha\) weights the contribution of the first reference molecule. The larger \(\alpha\) becomes, the more weight is put on the self overlap of the reference molecule.

Like the Tanimoto similarity, the Tversky similarity always has a value between 0.0 and 1.0 for shape. However, that may not be always true for color. Depending on the number and types of color atoms between molecules A and B, it is possible to have \(|overlapAB| > |selfA|\), and that along with certain value of \(\alpha\) can sometimes lead to \(Tversky_{A,B} > 1.0\).