Dataset Similarity and Clustering – DBSCAN

This Floe clusters datasets based on pre-generated fingerprints using Tanimoto similarity calculation and sklearn DBSCAN clustering. The Floe can generate the following datasets:

  • Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.

  • Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.

  • Singletons dataset that contains clusters with only one member.

When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.

If two molecules have a fingerprint Tanimoto similarity score less than the Minimum Similarity Score Cutoff parameter, they can be clustered together.

  • Increasing the default cutoff value will result in more clusters with fewer members and more singletons.

  • Decreasing the default cutoff value will result in less clusters with more members and less singletons.

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).

The fingerprint can be pre-generated using the following two Floes in this package:

  • Fingerprint Generation

  • Fingerprint Generation (User Defined) Floe

Extra Required Parameters

  • Input Dataset (data_source) : Dataset to cluster
  • Cluster Size Field (Field Type: Int) : The name for the field that will contain the size of the cluster the molecule is belong to.
    Default: Cluster Size
  • Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.
    Default: Fingerprint
  • 2D Chemical Space X (Field Type: Float) : Tag name for the field that stores the X coordinates of the 2D chemical space based on fingerprints.
    Default: Cluster X
  • 2D Chemical Space Y (Field Type: Float) : Tag name for the field that stores the Y coordinates for the 2D chemical space based on fingerprints.
    Default: Cluster Y
  • Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.
    Default: Fingerprint
  • Fingerprint Set (Field Type: RecordVec) : Fingerprint record sets
  • Log Field (Field Type: String) : The field to store messages to floe report
    Default: Log Field
  • UUID (Field Type: String) : The field to store unique identifiers for fingerprints and molecules.
  • Cluster ID Field (Field Type: Int) : The name for the field that will contain the unique cluster ID.
    Default: Cluster ID
  • Cluster Member Field (Field Type: Int) : The name for the field that will contain the unique number ID of the molecule in the cluster.
    Default: Cluster Member
  • Cluster Size Field (Field Type: Int) : The name for the field that will contain the size of the cluster the molecule is belong to.
    Default: Cluster Size
  • Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.
    Default: Fingerprint
  • Histogram Bin Centers (Field Type: FloatVec) : The field to store histogram bin centers of similarity calculation.and molecules.
    Default: Histogram Bin Centers
  • Histogram Counts (Field Type: FloatVec) : The field to store histogram counts of similarity calculation.
    Default: Histogram Counts
  • Output Members Dataset (dataset_out) : Output dataset to write to
    Default: members
  • Fingerprint Field (Field Type: Chem.FingerPrint) : Tag name for the field that stores fingerprints.
    Default: Fingerprint
  • Fingerprint Set (Field Type: RecordVec) : Fingerprint record sets
  • Histogram Bin Centers (Field Type: FloatVec) : The field to store histogram bin centers of similarity calculation.and molecules.
    Default: Histogram Bin Centers
  • Histogram Counts (Field Type: FloatVec) : The field to store histogram counts of similarity calculation.
    Default: Histogram Counts
  • Similarity Score Field (Field Type: Float) : Name for the field that stores fingerprint similarity scores.
  • UUID (Field Type: String) : The field to store unique identifiers for fingerprints and molecules.
  • Cluster ID Field (Field Type: Int) : The name for the field that will contain the unique cluster ID.
    Default: Cluster ID
  • Cluster Member Field (Field Type: Int) : The name for the field that will contain the unique number ID of the molecule in the cluster.
    Default: Cluster Member
  • Cluster Size Field (Field Type: Int) : The name for the field that will contain the size of the cluster the molecule is belong to.
    Default: Cluster Size
  • Fingerprint Set (Field Type: RecordVec) : Fingerprint record sets
  • Log Field (Field Type: String) : The field to store messages to floe report
    Default: Log Field
  • Output Cluster Cores (boolean) : If on, then one representative from each cluster will be sent to the ‘cores’ output dataset
    Default: False
  • Output Cluster Members (boolean) : If on, then each record with cluster id will be sent to the ‘members’ dataset.
    Default: True
  • Output Singletons (boolean) : If on, then singletons will be sent only to the ‘singletons’ output dataset. Otherwise they will be emitted to both the ‘members’ and ‘cores’ output datasets with the other records.
    Default: False
  • Similarity Score Field (Field Type: Float) : Name for the field that stores fingerprint similarity scores.
  • UUID (Field Type: String) : The field to store unique identifiers for fingerprints and molecules.