Molecule Similarity Calculation

../../../../../_images/CalculationInitCubeIcon.svg

This cube calculates fingerprint similarity scores between input molecules and one or more query (reference) molecules.

The input molecules are read from the intake port, from the field specified by the Input Molecule Field parameter.

The query molecules are read from the records on the init initialization port, from the field specified by the Query Field parameter.

The type of the generated fingerprint is determined by the Fingerprint Type parameter. The similarity measure that is used to calculate the score is determined by the Similarity Measure parameter.

The calculated score is stored in the field specified by Similarity Score Field, and the record is sent to the success port.

Note

This cube generates fingerprints on-the-fly in order to calculate the similarity score, but only the score will be stored on the output record.

See also

Calculation Parameters

  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • Fingerprint Type (fingerprint_type) type: string: The fingerprint type generated for similarity calculation.
    Default: Tree
    Choices: Circular, Lingo, MACCS, Path, Tree
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Max Rotors (max_rotors) type: integer: Cutoff of rotatable bonds. The cube will skip molecules with rotors more than the cutoff.
    Default: 40 , Min: 1, Max: 9999
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • None (similarity_score_cutoff) type: decimal:
    Default: 0.0
  • Similarity Measure (similarity_type) type: string: The similarity measure used to 2D similarity calculation.
    Default: Tanimoto
    Choices: Cosine, Dice, Euclid, Manhattan, Tanimoto, Tversky
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required

Field parameters

  • Extended Log Field (ext_log_field) type: Field Type: StringVec: Message extended log field
    Default: Extended Log Field
  • Input Molecules Field (in_mol_field) type: Field Type: Chem.Mol: The name of the field on the input records that stores the molecules compared to the query molecule. If left blank, the primary molecule field will be used.
  • Query Field (init_mol_field) type: Field Type: Chem.Mol: The name of the field on the initialization record that stores the query molecule. If left blank, the primary molecule field will be used.
  • None (is_query_field) type: Field Type: Bool:
  • Log Field (log_field) type: Field Type: String: The field to store messages to floe report
    Default: Log Field
  • None (out_mol_field) type: Field Type: Chem.Mol:
  • Query Molecule Title Field (query_molecule_title_field) type: Field Type: String: The title of the query molecule used to obtain the score.
  • Similarity Score Field (score_field) type: Field Type: Float: Name for the field that stores fingerprint similarity scores.

2D Similarity Parameters

The parameters of the 2D fingerprint similarity calculation.
  • Fingerprint Type (None) type: string: The fingerprint type generated for similarity calculation.
    Default: Tree
    Choices: Circular, Lingo, MACCS, Path, Tree
  • Similarity Measure (None) type: string: The similarity measure used to 2D similarity calculation.
    Default: Tanimoto
    Choices: Cosine, Dice, Euclid, Manhattan, Tanimoto, Tversky

Hardware Parameters

Machine hardware requirements
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters
  • Metric Period (None) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Cube Metrics (None) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network

Parallel Molecule Similarity Calculation

The parallel version adds these extra parameters.

  • Number of messages to distribute at a time (item_count) type: integer: The maximum number of messages to bundle together for a parallel cube.
    Default: 1 , Min: 1, Max: 65535
  • Maximum Failures (max_failures) type: integer: The maximum number of times to attempt processing a work item
    Default: 10 , Min: 1, Max: 100
  • Autoscale this Cube (autoscale) type: boolean: If True, let Orion manage the parallelism of this Cube
    Default: True
  • Maximum number of Cubes (max_parallel) type: integer: The maximum number of concurrently running copies of this Cube
    Default: 1000 , Min: 1
  • Minimum number of Cubes (min_parallel) type: integer: The minimum number of concurrently running copies of this Cube
    Default: 0