K-Medoids Clustering¶

Related Cubes

Parallel K-Medoids Clustering – parallel version of the cube

Calculation Parameters¶

CPUs (cpu_count) type: integer: The number of CPUs to run this cube with

Default: 1 , Min: 1, Max: 128

Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

Choices: cpu, disk, memory, network

Use Diagnostics (debug_mode) type: boolean:

Default: False

Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 5120.0 , Min: 128.0, Max: 8589934592

GPUs (gpu_count) type: integer: The number of GPUs to run this cube with

Default: 0 , Max: 16

Medoid Initialization Method (init_kmedoids) type: string: From scikit-learn-extra docs: Specify medoid initialization method. ‘random’ selects n_clusters elements from the dataset. ‘heuristic’ picks the n_clusters points with the smallest sum distance to every other point. ‘k-medoids++’ follows an approach based on k-means++_, and in general, gives initial medoids which are more separated than those generated by the other methods. ‘build’ is a greedy initialization of the medoids used in the original PAM algorithm. Often ‘build’ is more efficient but slower than other initializations on big datasets and it is also very non-robust, if there are outliers in the dataset, use another initialization.

Default: heuristic

Choices: random, heuristic, k-medoids++, build

Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)

Default: “”

Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on

None (matrix_input_file) type: file_in:

Maximum K-Medoids Iterations (max_iter) type: integer:

Default: 100000

Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 1800 , Min: 256.0, Max: 8589934592

Algorithm (method) type: string: Alternate is faster, pam is more accurate

Default: pam

Choices: alternate, pam

Distance Metric (metric) type: string: What distance metric to use.

Default: precomputed

Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds

Default: 60

Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300

Number of K-Medoids Clusters (n_clusters) type: integer:

Output Similarity Matrix (output_similarity_matrix) type: boolean:

Default: False

Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU

Default: 32

Random State (random_state) type: integer: Specify random state for the random number generator. Used to initialise medoids when init=’random’.

None (row_label_input_file) type: file_in:

Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address

Default: 64

Similarity Matrix Filename (similarity_matrix_filename) type: string:

Default: clustering_similarity_matrix.txt

Spot policy (spot_policy) type: string: Control cube placement on spot market instances

Default: Prohibited

Choices: Allowed, Preferred, NotPreferred, Prohibited, Required

None (use_matrix_input_file) type: boolean:

Default: False

Field parameters¶

Cluster ID Field (cluster_id_field) type: Field Type: String: The name for the field that will contain the unique cluster ID.

Default: Cluster ID

Cluster Method Field (cluster_method_field) type: Field Type: String: Field name for passing the clustering method to the floe report.

Default: Cluster Method

Cluster Parameters Field (cluster_parameters_field) type: Field Type: String: Field name for passing the cluster parameters to the Floe report.

Default: Parameters

None (coord_list_field) type: Field Type: IntVec:

Default: coord_list_field

Extended Log Field (ext_log_field) type: Field Type: StringVec: Message extended log field

Default: Extended Log Field

None (is_core) type: Field Type: Bool:

Default: is_core

Log Field (log_field) type: Field Type: String: The field to store messages to floe report

Default: Log Field

None (matrix_size_field) type: Field Type: Int:

Default: matrix_size

None (mol_field) type: Field Type: Chem.Mol:

Similarity Score Field (score_field) type: Field Type: Float: Name for the field that stores fingerprint similarity scores.

Default: similarity_score

None (score_list_field) type: Field Type: FloatVec:

Default: score_list_field

None (x_field) type: Field Type: Int:

Default: x

None (y_field) type: Field Type: Int:

Default: y

Hardware Parameters¶

Machine hardware requirements

Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 1800 , Min: 256.0, Max: 8589934592

Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address

Default: 64

Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU

Default: 32

Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 5120.0 , Min: 128.0, Max: 8589934592

GPUs (gpu_count) type: integer: The number of GPUs to run this cube with

Default: 0 , Max: 16

CPUs (cpu_count) type: integer: The number of CPUs to run this cube with

Default: 1 , Min: 1, Max: 128

Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on

Spot policy (spot_policy) type: string: Control cube placement on spot market instances

Default: Prohibited

Choices: Allowed, Preferred, NotPreferred, Prohibited, Required

Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)

Default: “”

Metrics Parameters¶

Cube Metric Parameters

Metric Period (None) type: decimal: How often to sample metrics, in seconds

Default: 60

Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300

Cube Metrics (None) type: string: Set of metrics to be collected

Choices: cpu, disk, memory, network

Parallel K-Medoids Clustering

The parallel version adds these extra parameters.

Number of messages to distribute at a time (item_count) type: integer: The maximum number of messages to bundle together for a parallel cube.

Default: 1 , Min: 1, Max: 65535

Maximum Failures (max_failures) type: integer: The maximum number of times to attempt processing a work item

Default: 10 , Min: 1, Max: 100

Autoscale this Cube (autoscale) type: boolean: If True, let Orion manage the parallelism of this Cube

Default: True

Maximum number of Cubes (max_parallel) type: integer: The maximum number of concurrently running copies of this Cube

Default: 1000 , Min: 1

Minimum number of Cubes (min_parallel) type: integer: The minimum number of concurrently running copies of this Cube

Default: 0