Dataset Similarity and Clustering – DBSCAN¶

This Floe clusters datasets based on pre-generated fingerprints using Tanimoto similarity calculation and sklearn DBSCAN clustering. The Floe can generate the following datasets:

Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.
Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.
Singletons dataset that contains clusters with only one member.

When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.

If two molecules have a fingerprint Tanimoto similarity score less than the Minimum Similarity Score Cutoff parameter, they can be clustered together.

Increasing the default cutoff value will result in more clusters with fewer members and more singletons.
Decreasing the default cutoff value will result in less clusters with more members and less singletons.

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).

The fingerprint can be pre-generated using the following two Floes in this package:

Fingerprint Generation
Fingerprint Generation (User Defined) Floe

Extra Required Parameters

Input Dataset (data_source) : Dataset to cluster

Cluster Size Field (Field Type: Int) : The name for the field that will contain the size of the cluster the molecule is belong to.

Default: Cluster Size

Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.

Default: Fingerprint

2D Chemical Space X (Field Type: Float) : Tag name for the field that stores the X coordinates of the 2D chemical space based on fingerprints.

Default: Cluster X

2D Chemical Space Y (Field Type: Float) : Tag name for the field that stores the Y coordinates for the 2D chemical space based on fingerprints.

Default: Cluster Y

Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.

Default: Fingerprint

Fingerprint Set (Field Type: RecordVec) : Fingerprint record sets

Log Field (Field Type: String) : The field to store messages to floe report

Default: Log Field

UUID (Field Type: String) : The field to store unique identifiers for fingerprints and molecules.

Cluster ID Field (Field Type: Int) : The name for the field that will contain the unique cluster ID.

Default: Cluster ID

Cluster Member Field (Field Type: Int) : The name for the field that will contain the unique number ID of the molecule in the cluster.

Default: Cluster Member

Cluster Size Field (Field Type: Int) : The name for the field that will contain the size of the cluster the molecule is belong to.

Default: Cluster Size

Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.

Default: Fingerprint

Histogram Bin Centers (Field Type: FloatVec) : The field to store histogram bin centers of similarity calculation.and molecules.

Default: Histogram Bin Centers

Histogram Counts (Field Type: FloatVec) : The field to store histogram counts of similarity calculation.

Default: Histogram Counts

Output Members Dataset (dataset_out) : Output dataset to write to

Default: members

Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.

Default: Fingerprint

Fingerprint Set (Field Type: RecordVec) : Fingerprint record sets

Histogram Bin Centers (Field Type: FloatVec) : The field to store histogram bin centers of similarity calculation.and molecules.

Default: Histogram Bin Centers

Histogram Counts (Field Type: FloatVec) : The field to store histogram counts of similarity calculation.

Default: Histogram Counts

Similarity Score Field (Field Type: Float) : Name for the field that stores fingerprint similarity scores.

UUID (Field Type: String) : The field to store unique identifiers for fingerprints and molecules.

Cluster ID Field (Field Type: Int) : The name for the field that will contain the unique cluster ID.

Default: Cluster ID

Cluster Member Field (Field Type: Int) : The name for the field that will contain the unique number ID of the molecule in the cluster.

Default: Cluster Member

Cluster Size Field (Field Type: Int) : The name for the field that will contain the size of the cluster the molecule is belong to.

Default: Cluster Size

Fingerprint Set (Field Type: RecordVec) : Fingerprint record sets

Log Field (Field Type: String) : The field to store messages to floe report

Default: Log Field

Output Cluster Cores (boolean) : If on, then one representative from each cluster will be sent to the ‘cores’ output dataset

Default: False

Output Cluster Members (boolean) : If on, then each record with cluster id will be sent to the ‘members’ dataset.

Default: True

Output Singletons (boolean) : If on, then singletons will be sent only to the ‘singletons’ output dataset. Otherwise they will be emitted to both the ‘members’ and ‘cores’ output datasets with the other records.

Default: False

Similarity Score Field (Field Type: Float) : Name for the field that stores fingerprint similarity scores.

UUID (Field Type: String) : The field to store unique identifiers for fingerprints and molecules.