ROCS X - Initialize 3D Search
Description
This floe initializes a ROCS X model for running a 3D similarity search on a query. First, the floe reads in a ROCS X 3D library that contains 3D conformers for synthons and a sample of products. An initial FastROCS search is performed with the product sample against the query. The hit list of this initial FastROCS search is used to seed the model. The top-scoring products in the hit list are fragmented to construct “bandit arms.” These bandit arms form the decision space that’s used when running a 3D search with the model. A second FastROCS search is performed with the bandit arms as queries against the synthon conformers to associate synthons with the bandit arms.
Key Inputs and Outputs
The key inputs are the query to search (only single-conformer molecule queries are supported at the time) and the ROCS X 3D library to search (typically the output from the Multi-Stage ROCS X Preparation or ROCS X - Prepare 3D Library Floe). A hit list from an external search on the query can be used as an optional input (see the Speed Up Preparation of a Very Large ROCS X 3D Library How-To Guide in the documentation).
The key output is a ROCS X 3D Search Model, which is typically used in the ROCS X - Run 3D Search Floe to run a 3D similarity search on the query with the model.
Cost Considerations
The floe cost scales with two key parameters: the number of synthons in the input ROCS X 3D Library and the Maximum Number of Bandit Arms parameter. The computational bottleneck is typically the FastROCS search of the bandit arm queries against the synthon conformers in the Parallel FastROCS (Synthons) Cube, which runs on a GPU instance. If the Maximum Number of Bandit Arms is half as much (500 compared to the default 1,000, e.g.), there will be half as many queries in the FastROCS search, so the cube should take half as long to run. Using too few bandit arms in the model can degrade search results. It is recommended to keep the Maximum Number of Bandit Arms at the default unless you are running a quick search or debugging.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Query to Search (query_in): The dataset containing the query. Only single-conformer molecule queries are supported at the time. If a multi-conformer molecule is provided, the floe will immediately fail.
Required
Type: data_source
ROCS X 3D Library to Search (fastrocs_collection_in): The collection containing the 3D library to search.
Required
Type: collection_source
Initialize with Sampled Products (sample_products_flag):
Type: boolean
Default: True
Choices: [True, False]
Inputs: Model Initialization
External Hit List for Query (records_in): (Optional) Dataset containing a hit list of ROCS scores from an external search on the query.
Type: data_source
External Hit List Score Field (ext_hitlist_field_name): (Optional) Field for 3D similarity scores on the external hit list.
Type: field_parameter::any
Default: Tanimoto Combo
Outputs
ROCS X 3D Search Model Collection (collection_out): The name of the output collection containing the initialized model that can be run in a 3D search.
Required
Type: collection_sink
Default: ROCS X 3D Search Model
Hit List from Product Sample Dataset (initial_sample_hit_list_dataset): The name of the output hit list dataset from ROCS rescoring of the top hits in the initial FastROCS search of the product sample against the query.
Required
Type: dataset_out
Default: Product Sample Hit List
Temporary Collection (temporary_collection): This collection is created by the floe for internal use during the run and is automatically deleted by the floe when it finishes.
Required
Type: collection_sink
Default: Temporary Collection
Failures Collection (failures_out): The name of the output collection of failures.
Type: collection_sink
Default: ROCS X Initialize 3D Search Failures
Options: Basic Bandit Arm Search
Maximum Number of Bandit Arms (max_init_arms): Bandit arm creation will stop if the number of bandit arms exceeds this threshold. The cost of the floe scales with this parameter. If it is half as much (500 compared to the default 1,000, for example), the cube that is the computational bottleneck should take half as long to run. Using too few bandit arms in the model can degrade search results. It is recommended to keep this at the default unless you are running a quick search or debugging.
Type: integer
Default: 1000
Number of Hits to Fragment (num_fragment): The number of top-scoring hits that will be used to create bandit arms for the model.
Type: integer
Default: 2500
Clustering Score Threshold (cluster_threshold): Threshold ROCS score for bandit arm clustering. Fragments scoring higher than the threshold will be clustered together in a bandit arm. Setting this higher may result in more bandit arms being discovered.
Required
Type: decimal
Default: 1.6
Synthon Search Score Threshold (searchlist_threshold): Threshold ROCS score for synthon search. Synthons scoring higher than the threshold will be associated with the search lists for the bandit arms. It is recommended to set this slightly lower than the Clustering Score Threshold.
Required
Type: decimal
Default: 1.4
Synthon Search List Size (num_searchlist): Maximum number of synthons to store on bandit arm search lists.
Type: integer
Default: 10000
Options: Overlay
FastROCS Similarity Type (fastrocs_similarity_type): Type of FastROCS similarity to use to rank molecules for the FastROCS hit lists.
Type: string
Default: Tanimoto Combo
Choices: [‘Tanimoto Combo’, ‘Ref Tversky’, ‘Fit Tversky’, ‘Shape Tanimoto’, ‘Shape Ref Tversky’, ‘Shape Fit Tversky’]
ROCS Score Field (score_field): Field on the record that contains ROCS scores against the query.
Required
Type: field_parameter::float
Default: Tanimoto Combo
Choices: [‘Tanimoto Combo’, ‘Tversky Combo’]
Number of FastROCS Random Starts (number_of_fastrocs_random_starts): If specified, FastROCS will use the specified number of random starting orientations for each conformer being overlaid with FastROCS. If unspecified, the default of 4 inertial starts will be used. Compute time and cost scale roughly linearly with the number of starts.
Type: integer
Shape Only FastROCS Overlay (shape_only_fastrocs_overlay): If set to On, FastROCS will overlay molecules using shape only, ignoring color. If set to Off, FastROCS will overlay molecules using shape and color. Note that this parameter affects the overlay process but not the scoring; that is, the overlay can be done with shape while the scoring is done with shape and color.
Type: boolean
Default: False
Choices: [True, False]
Options: Re-scoring
Hit List Size (hit_list_size): Size of product sample hit list. The maximum allowed value is 100,000. The minimum allowed value is 1000.
Required
Type: integer
Default: 10000
Number of Molecules to Rescore (number_of_molecules_to_rescore): The number of top-scoring molecules from FastROCS that will be sent to any of the enabled post-processing methods (ROCS). Note that the outputted hit lists will still be of the size specified by the Hit List Size parameter, which is generally smaller than this number. Maximum allowed value is 100,000,000. Minimum allowed value is 100,000.
Type: integer
Default: 100000
Maximum Conformers (rescore_max_confs): Maximum number of conformations to generate for ROCS rescoring.
Required
Type: integer
Default: 50
Number of ROCS Rescoring Random Starts (number_of_rocs_rescoring_random_starts): If specified, ROCS rescoring will use the specified number of random starting orientations for each conformer being overlaid. If unspecified, the default of 4 inertial starts will be used. Compute time scales roughly linearly with the number of starts.
Type: integer
GPU Hardware
FastROCS Instance Type (fastrocs_instance_type): The instances excluded by default are known to be not cost-effective for FastROCS.
Type: string
Default: !g4dn.metal,!g5.12xlarge,!g5.24xlarge,!g5.48xlarge,!g4dn.12xlarge,!g3s.,!p3.
Spot Instance Policy for FastROCS GPU Instance (spot_instance_policy_for_fastrocs_gpu_instance): To run on SPOT instances, use the default setting of ‘preferred’. To run on ON-DEMAND instances, set the value to ‘prohibited’. ON-DEMAND instances typically cost 3–4 times more than SPOT instances, but are more available than SPOT instances when overall demand for GPUs on AWS is high.
Type: string
Default: Preferred
Choices: [‘Allowed’, ‘Preferred’, ‘NotPreferred’, ‘Prohibited’, ‘Required’]
Parallel Limit for FastROCS cubes (maxparallel): The maximum number of concurrently running copies of this cube.
Type: integer
Default: 100
Options: Advanced Bandit Arm Search
Color Force Field (color_force_field): Color force field to be used
Type: string
Default: ImplicitMillsDean
Choices: [‘ImplicitMillsDean’, ‘ExplicitMillsDean’, ‘ImplicitMillsDeanNoRings’, ‘ExplicitMillsDeanNoRings’]
Custom Color Force Field File (custom_cff_in): Custom color force field to use for FastROCS/ROCS overlays in the 3D search.
Type: file_in
Primary Connector Color FF Weight (weight1): The weight parameter (usually negative) for the color force field of the connector atom(s).
Type: decimal
Default: -1.0
Primary Connector Color FF Radius (radius1): The radius parameter for the color force field of the connector atom(s).
Type: decimal
Default: 10.0
Secondary Connector Color FF Weight (weight2): The weight parameter (usually negative) for the color force field of the connector atom(s).
Type: decimal
Default: -1.0
Secondary Connector Color FF Radius (radius2): The radius parameter for the color force field of the connector atom(s).
Type: decimal
Default: 10.0
Connector Distance Threshold (connector_dist_thr): Threshold distance (in Angstroms) for bandit arm clustering. If the distance between connector atoms is greater than the threshold, the fragments will not be clustered together in a bandit arm.
Required
Type: decimal
Default: 3.0
ROCS Start Type (start_type): The type of starting orientations for ROCS calculations.
Type: string
Default: Inertial
Choices: [‘Inertial’, ‘Rocs’, ‘AsIs’, ‘AtAtom’, ‘Random’]
Number of Random Starts (num_rand_starts): If specified, FastROCS will use this number of random starts for each conformer overlay. If unspecified, the default inertial starts will be used.
Type: integer
FastROCS Records Per Shard (fastrocs_records_per_shard): The target number of records in a FastROCS shard. The recommended default is 800,000.
Type: integer
Default: 800000
Output Fields
Overlay Molecule Field (overlay_molecule_field): Field on the output records that will hold the structure of the molecule overlaid by ROCS or FastROCS.
Type: field_parameter::mol
Default: Overlay Molecule
Tanimoto Combo Field (tanimoto_combo_field): Output field with the Tanimoto Combo. This field will only be created if the score type for FastROCS Similarity Type is Tanimoto Combo. The value in this field is a duplicate of the value in Combo Similarity.
Type: field_parameter::float
Default: Tanimoto Combo
Tanimoto Color Field (tanimoto_color_field): Output field with the Color Tanimoto. This field will only be created if the score type for FastROCS Similarity Type is Tanimoto Combo. The value in this field is a duplicate of the value in Color Similarity.
Type: field_parameter::float
Default: Color Tanimoto
Tanimoto Shape Field (tanimoto_shape_field): Output field with the Shape Tanimoto. This field will only be created if the score type for FastROCS Similarity Type is Tanimoto Combo. The value in this field is a duplicate of the value in Shape Similarity.
Type: field_parameter::float
Default: Shape Tanimoto
Tversky Combo Field (tversky_combo_field): Output field with the Tversky Combo. This field will only be created if the score type for FastROCS Similarity Type is Fit Tversky or Ref Tversky. The value in this field is a duplicate of the value in Combo Similarity.
Type: field_parameter::float
Default: Tversky Combo
Tversky Color Field (tversky_color_field): Output field with the Color Tversky. This field will only be created if the score type for FastROCS Similarity Type is Fit Tversky or Ref Tversky. The value in this field is a duplicate of the value in Color Similarity.
Type: field_parameter::float
Default: Color Tversky
Tversky Shape Field (tversky_shape_field): Output field with the Shape Tversky. This field will only be created if the score type for FastROCS Similarity Type is Fit Tversky or Ref Tversky. The value in this field is a duplicate of the value in Shape Similarity.
Type: field_parameter::float
Default: Shape Tversky
Best Query Field (best_query_field): Output field for the query with the highest similarity to the fit molecule.
Type: field_parameter::mol
Default: Query
Best Query ID Field (best_query_id_field): Output field for the ID of the query with the highest similarity to the molecule. This identifier will also appear in the dataset specified by the Output Query Dataset parameter.
Type: field_parameter::int
Default: Query ID