Shape Distance Matrix

Calculates the distance between all molecules in a database with themselves. There will only be one entry per molecule, though all conformers will be considered in the comparison. This means the conformer used in a particular row or column of the matrix will not be consistent. The complete distance matrix is written out to the clusters.csv in comma separated format, useful for feeding into downstream clustering software.

Note

The values output will be a “distance”, not the tanimotos. That means a perfect match is ‘0.0’, not 1.0 or 2.0 respectively. The default is to use Tanimoto Combo. The -shapeOnly flag can be used to get only the shape distance.

Warning

This will generate O(N^2) amount of data and runtime. This is not a cheap script to run. This script can generally handle 1,000s to 10,000s in a reasonable timeframe on a modern GPU and a machine with decent memory and disk space.

Code

prompt> ShapeDistanceMatrix.py [-shapeOnly] [-dbase] <database> [-matrix] <clusters.csv>

Download code

ShapeDistanceMatrix.py