Datarecord Type Coercion And Field Splitting

Since Datarecord was intended to support an RDBMS as a storage option, a schema CANNOT have different data types in the same field (i.e. column) within a dataset. The job of drconvert and the ETL floes built on it is to inspect the values within an input file (e.g. CSV or SD) and decide each type of field so its values are all of a uniform type. Additionally, Orion will follow this same logic when datasets are updated directly.

Note

If the schema of an input file is known and well-defined, you can force drconvert to use a specific schema to parse the file. See the Using an External Schema section below.

In some cases, drconvert can force some values in a field to be a more general type. For example, if a field contains both integers and floating point numbers, it will force all values to be floating point because all numbers can be represented that way. Another example would be a field of numbers with a few <5 or >100 values. Those values can be represented by a Range type, and so can regular numbers, so drconvert will create a field of Range values where most of the ranges only have a single value.

There are other cases where drconvert cannot decide on a single type for a field. For example, if the field Foo contains both SMILES strings and numbers, rather than convert everything to a String type, it will split the field into two fields, Foo and Foo_2. One will contain all the SMILES strings and the other will contain only the numbers.

Drconvert will pretty much never force a number to a string type, because once you do that the numbers are no good for plotting, sorting, or anything else.

The coercion of values into other types is currently limited to the following.

Original Type

Coerced Type

Example

Integer

Float

5 => 5.0

Integer or Float

Range

1.23 => [1.23, 1.23]

Integer vector

Float vector

[8, 9, 10] => [8.0, 9.0, 10.0]

Float

Float vector

4.56 => [4.56]

SMILES

String

SMILES are converted to String if more than 5% of the values cannot be interpreted as SMILES.

Identifier Field Splitting Example

You may find it useful to see what drconvert will do using the verbose option:

drconvert -v <input_filename> out.oedb

But we will reformat that output here as a table.

Consider a CSV file with a column called Id.

Name,Id
one,1; 1-01
two,2
three,3; 4

Instead of using a single field for Id with type string, the parser will create string, integer, and integer vector fields.

Name(String)

Id(String)

Id_2(Int)

Id_3(IntVec)

one

1; 1-01

two

2

three

[3, 4]

However, if there we add another row that looks like an integer vector, then drconvert will coerce the integer into a vector.

Name,Id
one,1; 1-01
two,2
three,3; 4
four,4; 5

Name(String)

Id(IntVec)

Id_2(String)

one

1; 1-01

two

[2]

three

[3, 4]

four

[4, 5]

If you want only one Id field of type string, then you will either have to modify the data (e.g. prefix numbers with #) or you can write a custom script that constructs the type of field you want without any guessing by drconvert.

# THE FOLLOWING SCRIPT OR FLOE IS PROVIDED AS AN EXAMPLE AND IS PROPRIETARY
# TO OPENEYE SCIENTIFIC SOFTWARE INC AND IS SUBJECT TO THE FULL PROTECTION
# OF COPYRIGHT AND TRADESECRET LAW.
#
# AUTHORIZED USERS PURSUANT TO A VALID AND CURRENT LICENSE FROM OPENEYE ARE
# PERMITTED TO CREATE AND USE DERIVATIVE WORKS FROM THIS EXAMPLE SUBJECT TO
# THE TERMS AND CONDITIONS OF A VALD AND CURRENT OPENEYE LICENSE.
#
# ANY SUCH DERIVATIVE WORKS CAN BE USED ONLY DURING THE TERM OF A VALID LICENSE
# AND MUST BE DELETED WITHIN [90] DAYS AFTER TERMINATION OR EXPIRATION OF SUCH
# LICENSE IF ANOTHER LICENSE IS NOT ENTERED INTO FOR THE APPLICABLE SOFTWARE.
#
# ANY MODIFICATION OF THE CODE HEREIN IS AT CUSTOMER’S SOLE AND EXCLUSIVE RISK.
#
# PLEASE CONTACT OPENEYE AT LEGAL@EYESOPEN.COM IF YOU HAVE ANY QUESTIONS ABOUT
# THIS WARNING.

import csv
import sys

# Note: oechem must be imported before OpenEye toolkits
from openeye.oechem import Types, OEField, OERecord, OEWriteRecord, oeofstream

# usage:
# custom_csv_converter.py <input> <output>
# <output> should have a suffix of '.oedb'

ofs = oeofstream(sys.argv[2])

with open(sys.argv[1]) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        record = OERecord()
        name = OEField("Name", Types.String)
        identifier = OEField("Id", Types.String)
        record.set_value(name, row[0])
        record.set_value(identifier, row[1])
        OEWriteRecord(ofs, record)
        print(record)

Using an External Schema

To read a CSV or molecule file where the schema is known ahead of time, a converter can be supplied with a schema, defined by a single OERecord, to bypass all of the type-guessing and field-splitting. For example, if an input CSV file contains three columns, “A”, “B”, and “C”, and you expect column “A” to contain integers and “B” and “C” to contain floating point numbers, you could create a record which defines that schema:

from drconvert import CSVConverter, CSVConversionOptions

# Note: oechem must be imported before OpenEye toolkits
from openeye.oechem import Types, OEField, OERecord

csv_path = "path/to/a/csv_file.csv"

# Create a record to use as a schema, and add the desired fields
schema = OERecord()
schema.add_field(OEField("A", Types.Int))
schema.add_field(OEField("B", Types.Float))
schema.add_field(OEField("C", Types.Float))

# Pass the schema to create an options object, and give that to the converter
options = CSVConversionOptions(schema=schema)
converter = CSVConverter(csv_path, options=options)
# . . .

Handling Unexpected Values

If the values in the CSV file all have the expected types specified in the schema, the file will be processed and will output records that match the schema. If the CSV file has additional columns not specified in the schema, those columns will be ignored. What happens if a value is encountered that doesn’t fit in the provided schema? For example, what if column “A” contains a value “N/A”?

By default, this will cause an error and the converter will terminate with an exception. However, by specifying an error_field value to the CSVConversionOptions object, you can make the converter write any bad values it encounters into a new column and continue processing. We use a CSV file in these examples, but the same idea applies to molecule files, where the error_field would be supplied to a MolConversionOptions object.

from drconvert import CSVConverter, CSVConversionOptions

# Note: oechem must be imported before OpenEye toolkits
from openeye.oechem import Types, OEField, OERecord

csv_path = "path/to/a/csv_file.csv"

schema = OERecord()
schema.add_field(OEField("A", Types.Int))
schema.add_field(OEField("B", Types.Float))
schema.add_field(OEField("C", Types.Float))

# Provide a place to store errors:
options = CSVConversionOptions(schema=schema, error_field="Errors go here")
converter = CSVConverter(csv_path, options=options)
# . . .

The illegal values from a CSV, along with their target column name and type, will be written to the error field in the following form:

Bad values:
'A' (Int) : '25.78'
'B' (Float) : 'Something unexpected'

Manually Splitting Columns

Suppose you know that column “A” in the input file contains both numbers and strings, and you’d like to separate them into different fields. This can be done using metadata on the fields in the schema. Putting a Meta.Relations.SplitFrom attribute on a field will tell the converter where to put non-conforming values. In the following example code, any non-integer values found in column “A” will be put into a column named “A_other”.

from drconvert import CSVConverter, CSVConversionOptions

# Note: oechem must be imported before OpenEye toolkits
from openeye.oechem import Meta, Types, OEField, OERecord, OEFieldMeta

csv_path = "path/to/a/csv_file.csv"

schema = OERecord()

# Metadata for a string field to get values that won't fit in column "A"
split_meta = OEFieldMeta(attributes=[(Meta.Relations.SplitFrom, "A")])

schema.add_field(OEField("A", Types.Int))
# Use the metadata here:
schema.add_field(OEField("A_other", Types.String, meta=split_meta))
schema.add_field(OEField("B", Types.Float))
schema.add_field(OEField("C", Types.Float))

# Provide a place to store errors:
options = CSVConversionOptions(schema=schema, error_field="Errors go here")
converter = CSVConverter(csv_path, options=options)
# . . .

Renaming Columns

If you want the output records to have different field names than the input file, you can add metadata to the schema to rename the output fields. The following snippet shows how to rename column “A” to “Alpha”:

# Put metadata on the schema to rename the input field "A" to "Alpha"
rename_meta = OEFieldMeta(attributes=[(Meta.Relations.ReplacedBy, "Alpha")])
schema.add_field(OEField("A", Types.Int, meta=rename_meta))

Metadata on the Schema Fields

With one exception, any metadata on fields in the schema record will be transferred to the output records. The exception to that is when the Meta.Relations.ReplacedBy attribute is used to rename the output field, that attribute will be omitted from the output field.