Classification Based on String Field

../../../../../_images/ClassificationCube.svg

This cube classifies records based on a string field specified by the Input Classification Score Field parameter. Two records belong to the same class if they string data are identical. After the classification the records are sent to the following output ports:

If a record is sent to the singletons port, it will not be sent neither the members nor to the cores port.

Note

This cube has to cache records, therefore, it is not suitable for splitting large datasets.

Main Parameters

Parameter Name Associated Port Port Type
Class ID Field    
Class Member ID Field    
Class Size Field    
Input Classification Data Field    
Output Class Cores    
Output Class Members    
Output Singletons    

Calculation Parameters

  • CPUs (integer) : The number of CPUs to run this cube with
    Default: 1 Min: 1 Max: 128
  • Cube Metrics (string) : Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Temporary Disk Space (MiB) (decimal) : The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 Min: 128.0 Max: 8589934592
  • GPUs (integer) : The number of GPUs to run this cube with
    Default: 0 Max: 16
  • Instance Tags (string) : Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (string) : The type of instance that this cube needs to be run on
  • Memory (MiB) (decimal) : The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 Min: 256.0 Max: 8589934592
  • Metric Period (decimal) : How often to sample metrics, in seconds
    Default: 60 Min: 1 Max: 300
  • Output Class Cores (boolean) : If on, then one representative from each class will be sent to the ‘cores’ output dataset.
    Default: False
  • Output Class Members (boolean) : If on, then each record with class id will be sent to the ‘members’ dataset.
    Default: True
  • Output Singletons (boolean) : If on, then singletons will be sent only to the ‘singletons’ output dataset. Otherwise they will be emitted to both the ‘members’ and ‘cores’ output datasets with the other records.
    Default: False
  • Spot policy (string) : Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required

Field parameters

  • Class ID Field (Field Type: Int) : The name for the field that will contain the unique class ID.
    Default: Class ID
  • Class Member ID Field (Field Type: Int) : The name for the field that will contain the unique number ID of the molecule in its class.
    Default: Class Member
  • Class Size Field (Field Type: Int) : The name for the field that will contain the size of the class the molecule is belong to.
    Default: Class Size
  • Input Classification Data Field (Field Type: String) : The name for the input string data field.
  • Input Classification Score Field (Field Type: Float) : The name for the input data field that is used as a score to identify the core.

Hardware Parameters

Machine hardware requirements

  • Memory (MiB) (decimal) : The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 Min: 256.0 Max: 8589934592
  • Temporary Disk Space (MiB) (decimal) : The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 Min: 128.0 Max: 8589934592
  • GPUs (integer) : The number of GPUs to run this cube with
    Default: 0 Max: 16
  • CPUs (integer) : The number of CPUs to run this cube with
    Default: 1 Min: 1 Max: 128
  • Instance Type (string) : The type of instance that this cube needs to be run on
  • Spot policy (string) : Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (string) : Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters

  • Metric Period (decimal) : How often to sample metrics, in seconds
    Default: 60 Min: 1 Max: 300
  • Cube Metrics (string) : Set of metrics to be collected

    Choices: cpu, disk, memory, network

Tip

filename: snowball/logic/classify.py