Classification Based on String Field

../../../../../_images/ClassificationCube.svg

This cube classifies records based on a string field specified by the Input Classification Score Field parameter. Two records belong to the same class if they string data are identical. After the classification the records are sent to the following output ports:

  • If Output Members parameter is true, then all records with the following data fields will be sent to the members port: Class ID, Class Size, and Class Member ID Field.

  • If Output Class Cores parameter is true, then one representative from each class with the Class IS, and Class Size data fields will be sent to the core port.

  • If Output Singletons parameter is true, then classes with only one member will be send to the singletons port.

If a record is sent to the singletons port, it will not be sent neither the members nor to the cores port.

Note

This cube has to cache records, therefore, it is not suitable for splitting large datasets.

Calculation Parameters

  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Output Class Cores (output_cores) type: boolean: If on, then one representative from each class will be sent to the ‘cores’ output dataset.
    Default: False
  • Output Class Members (output_members) type: boolean: If on, then each record with class id will be sent to the ‘members’ dataset.
    Default: True
  • Output Singletons (output_singletons) type: boolean: If on, then singletons will be sent only to the ‘singletons’ output dataset. Otherwise they will be emitted to both the ‘members’ and ‘cores’ output datasets with the other records.
    Default: False
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required

Field parameters

  • Class ID Field (class_id_field) type: Field Type: Int: The name for the field that will contain the unique class ID.
    Default: Class ID
  • Class Member ID Field (class_member_field) type: Field Type: Int: The name for the field that will contain the unique number ID of the molecule in its class.
    Default: Class Member
  • Class Size Field (class_size_field) type: Field Type: Int: The name for the field that will contain the size of the class the molecule is belong to.
    Default: Class Size
  • Input Classification Data Field (input_field) type: Field Type: String: The name for the input string data field.
  • Input Classification Score Field (input_score_field) type: Field Type: Float: The name for the input data field that is used as a score to identify the core.

Hardware Parameters

Machine hardware requirements
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters
  • Metric Period (None) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Cube Metrics (None) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network