Quick Tour

Introduction

An OERecord is a data container that holds a variety of strongly-typed, named values, plus associated metadata. It is the primary way data is passed between cubes within a Floe, and is the main mechanism by which data is communicated to and from Orion.

Basic Functionality

Basic use of an OERecord is easy. Let’s create a record, and try a few things. First, we’ll create the record:

>>> from datarecord import OERecord, OEField, Types
>>> record = OERecord()

Next, we’ll create an OEField object to designate the name and type of the value we want to store on the record.

>>> field = OEField('My First Field', Types.String)

This creates a field with the name “My First Field” and a string type. We are now ready to store our value on the record.

>>> record.set_value(field, 'Hello World!')

We can retrieve the value we just set with the same field we used to set the value (or one that looks like it).

>>> record.get_value(field)
'Hello World!'

It isn’t necessary to use the exact same OEField object to retrieve our value. Any OEField object with the same name and type will do.

>>> other_field = OEField('My First Field', Types.String)
>>> record.get_value(other_field)
'Hello World!'

Besides get_value() and set_value(), other basic access to data is provided by the methods has_value() and clear_value(). Similar access to field information on records is provided by get_field(), add_field(), has_field(), and delete_field().

>>> record.has_value(field)
True
>>> record.clear_value(field)  # removes the value, but not the field
>>> record.has_value(field)
False
>>> record.has_field(field)
True
>>> record.get_value(field)  # returns None, because the value was cleared
>>> record.delete_field(field)
>>> record.has_field(field)
False
>>> record.get_value(field)  # also returns None, because the field no longer exists
>>>

What happens if we try to set a numerical value using our ‘My First Field’ field?

>>> record.set_value(field, 12.4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "..../datarecord/datarecord.py", line 433, in set_value
    self._set(field, value, by_value=True)
  File "..../datarecord/datarecord.py", line 469, in _set
    format(value, type_name))
ValueError: Invalid value '12.4' for field type String

Why did this fail? Because our field was created with a Types.String argument, which enforces that only string data can be stored for that field. OERecord is very fussy about types, and for a good reason. One of the most challenging aspects of data handling is knowing the specific types of data you are working with, and being consistent with those types. How many times have you worked with spreadsheet columns that had a mixture of strings and numbers, and how fun is it to sort that out?

To put another value onto our record, we’ll create another OEField object. This time, let’s make it a floating point number.

>>> score_field = OEField('Score', Types.Float)
>>> record.set_value(score_field, 21.4)
>>> record.get_value(score_field)
21.4
>>> record
OERecord (
    My First Field(String) : Hello World!
    Score(Float) : 21.4
    )

We can see the contents of the record by printing it. Notice that when you create an OEField object, there is no argument for an OERecord object. The field objects are completely independent of the records they are used with. This allows us to do something interesting.

property_field = OEField('My Property', Types.Float)
mol_field = OEField('Molecule', Types.Chem.Mol)

def my_function(record):
    mol = record.get_value(mol_field)  # pulls a molecule off of the record
    value = go_calculate_a_property(mol)
    record.set_value(property_field, value)

In this example, there are two OEField objects that are created and reused on any number of records. We don’t need to create field objects for each record – we create them once and reuse them. This pattern helps to create uniform data sets, because every record with a ‘My Property’ field on it will have values of identical, known type. This becomes important in a Floe environment, where data needs to be communicated between processes, and in Orion, where data consistency within columns in a dataset is very useful for all kinds of analyses.

Limitations

There are a few limitations to OERecord you should know about.
  • Fields on a record are unique by name. If you set a value on a record using a field whose name matches an existing field, but whose type is different, you’ll get a warning that the type of an existing field was changed.
  • Only the UTF-8 encoding scheme for strings is supported.
  • Floating point values are stored as 64-bit double-precision numbers.
  • Int types are signed 64-bit values. If you need larger ints, Types.ReallyBigInt should do the trick. It handles integers up to 1023 bits long. That is really big.
  • Records are limited to 140 characters. Just kidding! Records, and the individual data items on them, are essentially unlimited in size. However, keep in mind that records must fit into the memory of the machines they are used on.

Types

We’ve seen string fields, numeric fields, and molecule fields. What other types of data can be stored on an OEField? The following table lists the available types. They live in the class Types.

Types available to OEFields
Type Name Description
Types.Int Signed integers, up to 8 bytes in length
Types.Float Double precision floating point numbers
Types.Bool Boolean values
Types.String String values, encoded as UTF-8
Types.Range Range values, used to represent values like ‘<3’ or ‘>=100’
Types.ReallyBigInt Very large integers, up to 1023 bits (10**308)
Types.Chem.Mol OEMol objects
Types.Chem.Surface OESurface objects
Types.Chem.Grid OEScalarGrid objects
Types.Chem.FingerPrint OEFingerPrint objects
Types.JSONObject Python dict objects that can be serialized to JSON
Types.Record OERecord objects. Wait, what? Yes, you can store OERecords on other OERecords
(lists of the above types) Types.IntVec, Types.FloatVec, Types.RecordVec, etc.
Custom types If the above list isn’t enough, you can define your own types to put onto OERecords

Metadata

There are really three types of information that an OEField can have. In addition to a name and a data type, a field can hold metadata, or additional “data about the data”. Metadata can be used to convey additional information about the values in a field. For example, the constant Meta.Hints.Categorical can be used to indicate that the values from a specific field are categorical. Meta.Units.Mass.mg indicates that the values from a field are in milligrams. Other metadata values can be used to express relationships between different fields on a record, display hints for representing values, interpretation information, and other aspects of your data. The available metadata values can be seen in the reference documentation for the Meta class.

The OEField constructor method takes an optional argument, an OEFieldMeta object, which holds two types of metadata. Options are metadata keys without values, and attributes are metadata keys with values. To create an OEField object with metadata, you first construct an OEFieldMeta object, assign some options or attributes to it, then pass it to the OEField constructor. As with data types, metadata values are intended to be uniform for entire columns of data.

>>> from datarecord import OEFieldMeta, Meta
>>> meta = OEFieldMeta()
>>> meta.set_option(Meta.Hints.Chem.InChiKey)
>>> meta.set_option(Meta.Display.Hidden)
>>> inchi_key_field = OEField('InChi Key', Types.String, meta=meta)
>>> record.set_value(inchi_key_field, 'RYYVLZVUVIJVGH-VMIGTVKRSA-N')

In the above example, a String field is created with two pieces of metadata. The first indicates that the value should be interpreted as an InChi key. The second is a hint to a user interface that this column should be hidden by default.

The Meta class is a large, hierarchical collection of constants that can be used in field metadata. You can explore the hierarchy in a Python interpreter.

>>> from datarecord import Meta
>>> Meta.list()
Categories: Annotation, Constraints, Display, Flags, Hints, Limits, Relations, Source, Units
>>> Meta.Units.list()
Categories: Concentration, Energy, Frequency, Length, Mass, Temp, Time, Volume
>>> Meta.Units.Mass.list()
Values: g, kg, mg, ng, ug
>>> Meta.find('kg')
Units.Concentration.mg_per_kg, Units.Mass.kg

Special Handling of Units

As shown above, it is possible to specify the units of a field using an OEFieldMeta object. There is a shortcut for specifying units metadata for an OEField object, using the units named parameter.

>>> field = OEField('Mass', Types.Float, units=Meta.Units.Mass.kg)
>>> field.get_meta()
{"options": ["Units.Mass.kg"]}
>>> # or, even more simply:
>>> field_2 = OEField('Mass', Types.Float, units='kg')
>>> field_2.get_meta()
{"options": ["Units.Mass.kg"]}

Does this mean you can specify anything for the units string? Sadly, no. The string you provide must match one of the constants in the Meta.Units class. Using any unrecognized units will raise an exception.

Working with Molecules

As we’ve seen, molecules can be stored on records along with other data types. An OERecord may contain one molecule field, many molecule fields, or no molecule fields at all. Molecules don’t get any special treatment – they are just like every other piece of data on a record.

There are many cases where it is convenient to assume that a record has a single molecule of interest, and that the other data on the record is mainly about this molecule. There is a metadata option, Meta.Hints.Chem.PrimaryMol, that can be used to mark a field as the “primary molecule” on a record, but finding this field can be tricky if you’ve received a record and don’t know the name of the special molecule field. There is a subclass of OERecord, called OEMolRecord, which makes this and other handling of molecules easier. OEMolRecord also has methods for dealing with data associated with conformers, atoms, and bonds.

The example below illustrates three different ways to set and get a primary molecule on a record, starting with the most tedious approach.

from datarecord import OERecord, OEMolRecord, Types, Meta, OEPrimaryMolField

# Approach 1: Create a field with the right type and metadata
def approach_1(record):
    meta = OEFieldMeta()
    meta.set_option(Meta.Hints.Chem.PrimaryMol)
    mol_field = OEField('Molecule', Types.Chem.Mol, meta=meta)
    mol = record.get_value(mol_field)
    modify_molecule_somehow(mol)
    record.set_value(mol_field, mol)

# Approach 2: Use the magic OEPrimaryMolField object, which finds the
# "primary molecule" field from metadata
def approach_2(record):
    record.get_value(OEPrimaryMolField(), mol)
    modify_molecule_somehow(mol)
    record.set_value(OEPrimaryMolField(), mol)

# Approach 3: Use convenience methods on OEMolRecord. This method takes
# an OEMolRecord instead of an OERecord
def approach_3(mol_record):
    mol = mol_record.get_mol()
    modify_molecule_somehow(mol)
    mol_record.set_mol(mol)

Conformer Data

The OEMolRecord gives convenient access to the primary molecule on a record, but much of the specialization of the molecule record is to provide a way to keep data for conformers.

The general concept is that each conformer can get its own OERecord object, that can be filled with whatever data needs to be associated with that conformer. OEMolRecord has methods to set, get, check for, and delete these special conformer records. The following code illustrates the process of getting a molecule from a record, assigning data to its conformers, and updating the record.

from datarecord import OEMolRecord, OEField

score_field = OEField('Score', Types.Float)

def set_data_for_conformers(mol_record):
    mol = mol_record.get_mol()
    for conf in mol.GetConfs()
        value = calculate_some_score(conf)
        # Get a record containing data for this conformer
        conformer_data = mol_record.get_conf_record(conf)
        # Add some data to that record
        conformer_data.set_value(score_field, value)
        # Put the conformer's record back onto the parent record
        mol_record.set_conf_record(conf, conformer_data)

    mol_record.set_mol(mol)  # THIS IS IMPORTANT. SEE BELOW.

In this example, we stored just one piece of data for each conformer, but the conformer’s record can hold any number of fields.

Note

In the above example, it is imperative that the molecule be set on the record after setting conformer data. The gory details are that two things are happening that you don’t see. First, the get_mol method, like get_value, returns a copy of the molecule from the record. Second, the set_conf_record method secretly marks the conformer objects (on the copied molecule) with bookkeeping information. If the molecule is not re-added to the record, the bookkeeping information is lost, and the conformer data will not be retained on the record.

Another solution is to use the get_mol_reference method, instead of get_mol. This returns the actual molecule from the record and not a copy.

Atom and Bond Data

The good news here is that atom and bond data are handled exactly the same way as conformer data. Substitute “atom” or “bond” for “conf” above, and everything should just work.

Advanced Topics

References vs. Copies

The get_value() and set_value() methods, when used with fields containing object types, such as molecules and records, create copies of the values on the record. This has implications for both performance and behavior in certain cases. Some types of objects are relatively expensive to copy, and when the cost of making a copy is comparable to the length of a fast calculation, this expense can be a concern.

An alternative to using get_value() is to use get_reference(). Instead of returning an object copy, this method returns a reference. While using references can avoid copy operations, thereby improving performance, there are come caveats to their use. Modifying an object after getting its reference on a record is actually modifying the object on the record. Forgetting this fact can lead to surprises. Doing the following:

molecule = record.get_reference(mol_field)
...
molecule.DeleteConfs()

will leave a molecule on the record without any conformers.

Examining Fields on Records

Three methods exist for examining fields on a record. Two of them are on OERecord and one on OEField. OERecord.get_field() can be used to retrieve a field object from a record by name. OERecord.get_fields() is a generator method yielding copies all of the OEField objects from a record. Both methods have an include_meta parameter to indicate whether the field metadata should be included on the returned field objects.

OEField has get_meta and set_meta methods which return or set the field’s metadata in an OEFieldMeta object.

Match Fields

Sometimes it is desirable to locate fields on a record when you know the data type of the field but not the name. A subclass of OEField, called OEMatchField allows retrieval of data based on its type and its metadata. An OEMatchField acts as a surrogate for the first matching field on a record. The following example accesses data for a String field with the Meta.Flags.ID metadata in two different ways.

from datarecord import OERecord, Meta, OEMatchField, OEFieldMeta

# First the hard way
def get_ID_value(record):
    for field in record.get_fields(include_meta=True):
        if field.get_type() is Types.String:
            meta = field.get_meta()
            if meta.has_option(Meta.Flags.ID):
                return record.get_value(field)
    return None

# Now the easier way
meta = OEFieldMeta(options=Meta.Flags.ID)
id_field = OEMatchField('ID', Types.String, meta=meta)

def get_ID_value(record):
    return record.get_value(id_field)

Custom Data Types

There may be cases where you need to put a custom data type onto a record. The preferred way to do this is to create a handler class derived from CustomHandler. The derived class must implement four methods:

  • get_name() returns the name of your custom type.
  • validate() is called by the set_value method, and should return True if the object passed is appropriate for your custom type.
  • serialize() converts your custom object into bytes.
  • deserialize() creates an instance of your custom object from bytes.

Optionally, you can implement a copy() method to return a copy of your object. If you don’t provide an implementation, the record will use a default (and possibly slow) implementation.

Once you have created your custom handler class, you can pass it to the OEField constructor as you would any other type, to create a field capable of storing your custom data type onto a record.

The following code creates a custom type handler for a Python dict object.

from datarecord import OERecord, OEField, Types, CustomHandler
import json

class MyHandler(CustomHandler):
    @staticmethod
    def get_name():
        return 'DictType'

    @staticmethod
    def serialize(thing):
        # This method must convert my object (dict) into bytes
        return json.dumps(thing).encode()

    @staticmethod
    def deserialize(byte_data):
        # Create my object (dict) from bytes
        return json.loads(byte_data.decode())

    @classmethod
    def validate(cls, thing):
        if not isinstance(thing, dict):
            return false
        # Make sure our object will serialize
        try:
            MyHandler.serialize(thing)
            return True
        except ValueError:
            return False

We can now use our custom handler to read and write a dict object:

>>> record = OERecord()
>>> field = OEField('My Custom Field', MyHandler)
>>> test_object = {'question': 'what is your quest?'}
>>> record.set_value(field, test_object)
>>> record.get_value(field)
{'question': 'what is your quest?'}

Missing Values

By default, OEField objects will not store None values on a record. There are cases, however, where a distinction is made between a value that was never set on a record and a value set to None.

OEField has a keyword argument nullable, that imparts some special behavior to the field.

  • A nullable field can be used to set a None value on a record. (A non-nullable field will raise an exception if you try this.)
  • A nullable field can retrieve a None value from a record.
  • If a null value is written with a nullable field, and read with a non-nullable field, the result will be None for Int and Bool types or a default value for most other types. See the following table.
  • None values in list types (Types.StringVec, Types.FloatVec, etc.) are treated similarly to scalar types on read. A missing list value will be converted to a default value when read by a non-nullable field.
Values returned by non-nullable fields for null values
Type Name Return Value
Types.Int None
Types.Float NaN (math.nan)
Types.Bool None
Types.String An empty string
Types.Range The range [-inf, inf]
Types.ReallyBigInt None
Types.Chem.Mol An empty OEMol object
Types.Chem.Surface An empty OESurface object
Types.Chem.Grid An empty OEScalarGrid object
Types.Chem.FingerPrint An empty OEFingerPrint object
Types.Record An empty OERecord object
(lists of the above types) An empty list object []

What is the difference between a value that was never set on a field and a value that was set to None? For a nullable field, a record’s get_value() method will return None in either case, but other methods can be used to tell the difference.

Behavior of methods with normal, unset, and null values
Method Normal Unset Null
Nullable Fields      
record.has_value(field) True False True
record.get_value(field) (Value) None None
record.is_na(field) False False True
Non-nullable Fields      
record.has_value(field) True False True
record.get_value(field) (Value) None (Default values, see above)
record.is_na(field) False False True