v1.1.0 December 2022¶
This package is built using
The 2D Diverse Subset and 3D Diverse Subset floes were added, which find diverse subsets of the requested size from input molecule datasets using 2D and 3D clustering, respectively.
The 2D Hitlist Clustering and 3D Hitlist Clustering floes were added, which cluster large 2D and 3D hitlists using a provided score field to direct sphere exclusion clustering. They also provide output sorted by the clusters with the best scores.
The Large Scale 2D Similarity Clustering and Large Scale 3D Similarity Clustering floes were added, which can cluster large datasets of over input 100,000 molecules, using directed sphere exclusion clustering.
The K-Medoids 2D Similarity Clustering and K-Medoids 3D Similarity Clustering floes were added, which cluster input datasets using OEGraphsim 2D similarity and ROCS 3D Similarity scores, and scikit-learn k-medoids clustering.
The DBSCAN 2D Similarity Clustering and DBSCAN 3D Similarity Clustering floes were added, which use ROCS 3D similarity scores to cluster input datasets using scikit-learn DBSCAN or hierarchical clustering.
The Generate 2D Similarity Matrix and Generate 3D Similarity Matrix floes were added, which calculate similarity scores using OEGraphSim and ROCS, gather summary statistics on these scores, and optionally generate 2D or 3D similarity or distance matrices as files that can be input to other clustering floes.
The Calculate Average Precision floe was added, which calculates average precision on an input dataset using a binary classifier.
2D clustering floes will now allow either pregenerated fingerprints, or generate fingerprints within the floe.
The 2D and 3D DBSCAN, Hierarchical, and K-Medoids clustering floes can optionally output the distance matrix that was calculated for clustering.
The previously existing 2D Hierarchical and DBSCAN clustering floes have been optimized to run much faster and produce more accurate results.
The DBSCAN 2D Similarity Clustering and DBSCAN 3D Similarity Clustering floes were tuned to more accurately calculate a reasonable EPS automtically using constraints on the largest cluster percentage, if EPS is not provided. These floes also now output outliers as singleton clusters, instead of ignoring them in the output dataset.
The 2D DBSCAN, Hierarchical, and K-Medoids clustering floes can now take a similarity or matrix distance numpy binary file as input, for custom clustering applications.
Output for any of the clustering floes can now optionally sort clusters based on a selected score field for each molecule.
v1.0.0 July 2022¶
This package is built using
The MultistatePKaModel based Ionization states enumeration floe was added, which enumerates the reasonable ionization state(s) of input molecules at neutral/physiological pH (7.4) based on the pKa assessed using a multistate pKa model.
The Hierarchical:ref:Hierarchical 2D Similarity Clustering<floe_Hierarchical2DSimilarityClustering> floe was added. This floe clusters datasets based on pre-generated fingerprints using Tanimoto similarity calculation and hierarchical clustering. Unlike the existing DBSCAN and sphere exclusion floes, this floe allows the user to specify the number of clusters they would like.
The Dataset Manipulation – Add Molecule Title Field floe was added. This floe updates a dataset with a title field for the primary molecule of that dataset.
The Dataset Manipulation – Add Title to Molecule Field floe was added. This floe updates the primary molecule field of a dataset with a title taken from a string field in that dataset.
The Dataset Filtering – Create Custom Filter floe was added. This floe creates a custom molecule filter file compatible with the OEMolProp toolkit.
All floes have a new brief description and are placed in the new Orion floe classification system.
A clustering tutorial was added that briefly describes how to run the clustering floes and analyze their data.
The Dataset Subsetting – Random Splitting floe was combined with the Dataset Subsetting – Random Splitting floe, into the Dataset Subsetting – Random Splitting or Selection floe. The combined floe has been redesigned so that more of the cubes can run in parallel.
The DBSCAN 2D Similarity Clustering floe was modified to give the user more control over the size of the clusters in the floe. It now has two optional parameters, minimum and maximum largest cluster percentage, which can be used in place of eps.
Error handling was improved in the Dataset Manipulation – Field Type Conversion floe.
v0.2.4 December 2021¶
This package is built using
Dataset Manipulation -- Field Type Conversionfloe was added, which converts fields on records of basic types (boolean, integer, float, and string) to fields of another basic type.
The two dataset clustering floes were updated to include singleton counts in floe reports in all cases, even if writing to a singleton dataset.
A bug in the
Dataset Similarity - Fingerprint Generationfloe was fixed. The floe now writes molecules that failed fingerprinting to a failure dataset.
The minimum Tanimoto cutoff for clustering floes has been removed, so any cutoff as low as zero can be used.
v0.2.3 June 2021¶
This package is built using OpenEye-Snowball==0.21.0 and the associated OpenEye-orionplatform.
The previously existing four separate subset floes were combined into a single floe,
Dataset Subsetting. that can subset based on a string field, numerical field, dataset, or regex.
Dataset Subsetting Based on String KeysFloe takes a dataset and two input parameters: a string field from that dataset, and a string parameter as input. It splits the string field by line to create keys, and then emits records from the input dataset that have values of the specified string field which match any of these keys.
Created two new floes that combine functionality in existing floes:
Generate and Deduplicate SMILES for a DatasetFloe adds a new string data field to a dataset that stores the SMILES representation of the primary molecule of each record. It then deduplicates the dataset based on canonical SMILES.
Generate and Deduplicate SMILES for One or More DatasetsFloe does the same as the floe above, but also concatenates input datasets.
Extended the floes that use deduplication to be able to do this deduplication based on a string field; float field or int field, with numerical tolerance; or a molecule field.
Fixed a bug in the
Dataset Classification -- Bemis-Murckofloe which was restricting the name of the molecule input column. Now the column does not need to have a specific name.
Floes were standardized with the floe_endgame function from snowball, which abstracts the success and failure output behavior of a floe.
Floe descriptions were improved and extended.
v0.2.2 December 2020¶
This package is built using OpenEye-Snowball==0.20.1 and the associated OpenEye-orionplatform.
This version solves some dependency resolution issues
v0.2.1 November 2020¶
This package is built using OpenEye-Snowball==0.20.0 and the associated OpenEye-orionplatform.
v0.2.0 August 2020¶
Upgraded to use
OpenEye-Snowball==0.19.0and the associated
Minor bug fixes and improvements to default output dataset names for Floes
The DBSCAN Cube has been re-factored, exposing DBSCAN algorithm parameters epsilon and minimum samples. Furthermore, the automatic estimation of epsilon has been improved, in the case the user does not supply one.
v0.1.1 April 2020¶
The package is built using
OpenEye-Snowball==0.18.0and the associated
Dataset Append -- Generating SMILES FieldFloe has been added that adds a SMILES field to records
Dataset Classification -- Bemis-MurckoFloe has been added that classifies molecules based on their Bemis-Murcko frameworks
Dataset Manipulation -- ConcatenationFloe has been added to concatenate datasets
Dataset Filtering -- Built-in Filter TypesFloe has been added that filters dataset based on built-in filtering types
Dataset Subsetting -- Random SelectionFloe has been added that randomly selects N records
Dataset Subsetting -- Random SplittingFloe has been added that randomly datasets
Dataset Manipulation -- Field Rename FloeFloe has been added that renames a record field
Dataset Subsetting -- Based on Reference DatasetFloe has been added that subsets a dataset based on whether its molecules existence in a reference dataset.
Dataset Subsetting -- Based on Numerical FieldFloe has been added that subsets a dataset based on numerical (float/int) data field
Dataset Subsetting -- Based on String FieldFloe has been added that subsets a dataset based on a string data field
Dataset Subsetting -- Based on String Field (Regex)Floe has been added that subsets a dataset based on a string data field match to a given regular expression
Dataset Deduplication -- Based on String FieldFloe has been added that deduplicate a dataset based on a user-defined string field.
Dataset Deduplication -- Based on SMILESFloe has been added that deduplicate a dataset based on canonical SMILES.