Classification Based on String Field
This cube classifies records based on a string field specified by the Input Classification Score Field parameter. Two records belong to the same class if they string data are identical. After the classification the records are sent to the following output ports:
If Output Members parameter is true, then all records with the following data fields will be sent to the members port: Class ID, Class Size, and Class Member ID Field.
If Output Class Cores parameter is true, then one representative from each class with the Class IS, and Class Size data fields will be sent to the core port.
If Output Singletons parameter is true, then classes with only one member will be send to the singletons port.
If a record is sent to the singletons port, it will not be sent neither the members nor to the cores port.
Note
This cube has to cache records, therefore, it is not suitable for splitting large datasets.
Calculation Parameters
CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128 Cube Metrics (cube_metrics) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592 GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16 Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “” Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluatedDefault: 600 , Min: 300 Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592 Metric Period (metric_period) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300 Output Class Cores (output_cores) type: boolean: If on, then one representative from each class will be sent to the ‘cores’ output dataset.Default: False Output Class Members (output_members) type: boolean: If on, then each record with class id will be sent to the ‘members’ dataset.Default: True Output Singletons (output_singletons) type: boolean: If on, then singletons will be sent only to the ‘singletons’ output dataset. Otherwise they will be emitted to both the ‘members’ and ‘cores’ output datasets with the other records.Default: False Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32 Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64 Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
Field parameters
Class ID Field (class_id_field) type: Field Type: Int: The name for the field that will contain the unique class ID.Default: Class ID Class Member ID Field (class_member_field) type: Field Type: Int: The name for the field that will contain the unique number ID of the molecule in its class.Default: Class Member Class Size Field (class_size_field) type: Field Type: Int: The name for the field that will contain the size of the class the molecule is belong to.Default: Class Size Input Classification Data Field (input_field) type: Field Type: String: The name for the input string data field. Input Classification Score Field (input_score_field) type: Field Type: Float: The name for the input data field that is used as a score to identify the core.
Hardware Parameters
- Machine hardware requirements
- Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592
- Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64
- Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32
- Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluatedDefault: 600 , Min: 300
- Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592
- GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16
- CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128
- Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
- Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
- Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “”
Metrics Parameters
- Cube Metric Parameters
- Metric Period (None) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
- Cube Metrics (None) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network