Datarecord Type Coercion And Field Splitting

Since Datarecord was intended to support an RDBMS as a storage option, a schema CANNOT have different data types in the same field (i.e. column) within a dataset. The job of drconvert and the ETL floes built on it is to inspect the values within an input file (e.g. CSV or SD) and decide each type of field so its values are all of a uniform type. Additionally, Orion will follow this same logic when datasets are updated directly.

In some cases, drconvert can force some values in a field to be a more general type. For example, if a field contains both integers and floating point numbers, it will force all values to be floating point because all numbers can be represented that way. Another example would be a field of numbers with a few <5 or >100 values. Those values can be represented by a Range type, and so can regular numbers, so drconvert will create a field of Range values where most of the ranges only have a single value.

There are other cases where drconvert cannot decide on a single type for a field. For example, if the field Foo contains both SMILES strings and numbers, rather than convert everything to a String type, it will split the field into two fields, Foo and Foo_2. One will contain all the SMILES strings and the other will contain only the numbers.

Drconvert will pretty much never force a number to a string type, because once you do that the numbers are no good for plotting, sorting, or anything else.

The coercion of values into other types is currently limited to the following.

Original Type

Coerced Type

Example

Integer

Float

5 => 5.0

Integer or Float

Range

1.23 => [1.23, 1.23]

Integer vector

Float vector

[8, 9, 10] => [8.0, 9.0, 10.0]

Float

Float vector

4.56 => [4.56]

SMILES

String

SMILES are converted to String if more than 5% of the values cannot be interpreted as SMILES.

Identifier Field Splitting Example

You may find it useful to see what drconvert will do using the verbose option:

drconvert -v <input_filename> out.oedb

But we will reformat that output here as a table.

Consider a CSV file with a column called Id.

Name,Id
one,1; 1-01
two,2
three,3; 4

Instead of using a single field for Id with type string, the parser will create string, integer, and integer vector fields.

Name(String)

Id(String)

Id_2(Int)

Id_3(IntVec)

one

1; 1-01

two

2

three

[3, 4]

However, if there we add another row that looks like an integer vector, then drconvert will coerce the integer into a vector.

Name,Id
one,1; 1-01
two,2
three,3; 4
four,4; 5

Name(String)

Id(IntVec)

Id_2(String)

one

1; 1-01

two

[2]

three

[3, 4]

four

[4, 5]

If you want only one Id field of type string, then you will either have to modify the data (e.g. prefix numbers with #) or you can write a custom script that constructs the type of field you want without any guessing by drconvert.

# THE FOLLOWING SCRIPT OR FLOE IS PROVIDED AS AN EXAMPLE AND IS PROPRIETARY
# TO OPENEYE SCIENTIFIC SOFTWARE INC AND IS SUBJECT TO THE FULL PROTECTION
# OF COPYRIGHT AND TRADESECRET LAW.
#
# AUTHORIZED USERS PURSUANT TO A VALID AND CURRENT LICENSE FROM OPENEYE ARE
# PERMITTED TO CREATE AND USE DERIVATIVE WORKS FROM THIS EXAMPLE SUBJECT TO
# THE TERMS AND CONDITIONS OF A VALD AND CURRENT OPENEYE LICENSE.
#
# ANY SUCH DERIVATIVE WORKS CAN BE USED ONLY DURING THE TERM OF A VALID LICENSE
# AND MUST BE DELETED WITHIN [90] DAYS AFTER TERMINATION OR EXPIRATION OF SUCH
# LICENSE IF ANOTHER LICENSE IS NOT ENTERED INTO FOR THE APPLICABLE SOFTWARE.
#
# ANY MODIFICATION OF THE CODE HEREIN IS AT CUSTOMER’S SOLE AND EXCLUSIVE RISK.
#
# PLEASE CONTACT OPENEYE AT LEGAL@EYESOPEN.COM IF YOU HAVE ANY QUESTIONS ABOUT
# THIS WARNING.

import csv
import sys

from openeye.oechem import (
  OEField,
  oeofstream,
  OERecord,
  OEWriteRecord,
  Types
)

# usage:
# custom_csv_converter.py <input> <output>
# <output> should have a suffix of '.oedb'

ofs = oeofstream(sys.argv[2])

with open(sys.argv[1]) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        record = OERecord()
        name = OEField("Name", Types.String)
        identifier = OEField("Id", Types.String)
        record.set_value(name, row[0])
        record.set_value(identifier, row[1])
        OEWriteRecord(ofs, record)
        print(record)