Classification Based on String Field¶
This cube classifies records based on a string field specified by the Input Classification Score Field parameter. Two records belong to the same class if they string data are identical. After the classification the records are sent to the following output ports:
- If Output Members parameter is true, then all records with the following data fields will be sent to the members port: Class ID, Class Size, and Class Member ID Field.
- If Output Class Cores parameter is true, then one representative from each class with the Class IS, and Class Size data fields will be sent to the core port.
- If Output Singletons parameter is true, then classes with only one member will be send to the singletons port.
If a record is sent to the singletons port, it will not be sent neither the members nor to the cores port.
Note
This cube has to cache records, therefore, it is not suitable for splitting large datasets.
Main Parameters¶
Parameter Name | Associated Port | Port Type |
---|---|---|
Class ID Field | ||
Class Member ID Field | ||
Class Size Field | ||
Input Classification Data Field | ||
Output Class Cores | ||
Output Class Members | ||
Output Singletons |
Parameter Details¶
Calculation Parameters¶
CPUs (integer) : The number of CPUs to run this cube withDefault: 1 Min: 1 Max: 128
Cube Metrics (string) : Set of metrics to be collectedChoices: cpu, disk, memory, network
Temporary Disk Space (MiB) (decimal) : The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 Min: 128.0 Max: 8589934592
GPUs (integer) : The number of GPUs to run this cube withDefault: 0 Max: 16
Instance Tags (string) : Only run on machines with matching tags (comma separated)Default: “”
Instance Type (string) : The type of instance that this cube needs to be run on
Memory (MiB) (decimal) : The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 Min: 256.0 Max: 8589934592
Metric Period (decimal) : How often to sample metrics, in secondsDefault: 60 Min: 1 Max: 300
Output Class Cores (boolean) : If on, then one representative from each class will be sent to the ‘cores’ output dataset.Default: False
Output Class Members (boolean) : If on, then each record with class id will be sent to the ‘members’ dataset.Default: True
Output Singletons (boolean) : If on, then singletons will be sent only to the ‘singletons’ output dataset. Otherwise they will be emitted to both the ‘members’ and ‘cores’ output datasets with the other records.Default: False
Spot policy (string) : Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
Field parameters¶
Class ID Field (Field Type: Int) : The name for the field that will contain the unique class ID.Default: Class ID
Class Member ID Field (Field Type: Int) : The name for the field that will contain the unique number ID of the molecule in its class.Default: Class Member
Class Size Field (Field Type: Int) : The name for the field that will contain the size of the class the molecule is belong to.Default: Class Size
Input Classification Data Field (Field Type: String) : The name for the input string data field.
Input Classification Score Field (Field Type: Float) : The name for the input data field that is used as a score to identify the core.
Hardware Parameters¶
Machine hardware requirements
Memory (MiB) (decimal) : The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 Min: 256.0 Max: 8589934592
Temporary Disk Space (MiB) (decimal) : The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 Min: 128.0 Max: 8589934592
GPUs (integer) : The number of GPUs to run this cube withDefault: 0 Max: 16
CPUs (integer) : The number of CPUs to run this cube withDefault: 1 Min: 1 Max: 128
Instance Type (string) : The type of instance that this cube needs to be run on
Spot policy (string) : Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
Instance Tags (string) : Only run on machines with matching tags (comma separated)Default: “”
Metrics Parameters¶
Cube Metric Parameters
Metric Period (decimal) : How often to sample metrics, in secondsDefault: 60 Min: 1 Max: 300
Cube Metrics (string) : Set of metrics to be collectedChoices: cpu, disk, memory, network
Tip
filename: snowball/logic/classify.py