Quick Tour
Introduction
An OERecord
is a data container that holds a variety of strongly-typed, named values, plus associated metadata. It is the primary way data is passed between cubes within a Floe, and is the main mechanism by which data is communicated to and from Orion.
Basic Functionality
Basic use of an OERecord
is easy. Let’s create a record, and try
a few things. First, we’ll create the record:
>>> from datarecord import OERecord, OEField, Types
>>> record = OERecord()
Next, we’ll create an OEField
object to designate the name and type of the value we want to store on the record.
>>> field = OEField('My First Field', Types.String)
This creates a field with the name “My First Field” and a string type. We are now ready to store our value on the record.
>>> record.set_value(field, 'Hello World!')
We can retrieve the value we just set with the same field we used to set the value (or one that looks like it).
>>> record.get_value(field)
'Hello World!'
It isn’t necessary to use the exact same OEField
object to
retrieve our value. Any OEField
object with the same name and type
will do.
>>> other_field = OEField('My First Field', Types.String)
>>> record.get_value(other_field)
'Hello World!'
Besides datarecord.OERecord.get_value
and datarecord.OERecord.set_value
,
other basic access to data is provided by the methods
datarecord.OERecord.has_value
and datarecord.OERecord.clear_value
. Similar
access to field information on records is provided by
datarecord.OERecord.get_field
, datarecord.OERecord.add_field
,
datarecord.OERecord.has_field
, and datarecord.OERecord.delete_field
.
>>> record.has_value(field)
True
>>> record.clear_value(field) # removes the value, but not the field
>>> record.has_value(field)
False
>>> record.has_field(field)
True
>>> record.get_value(field) # returns None, because the value was cleared
>>> record.delete_field(field)
>>> record.has_field(field)
False
>>> record.get_value(field) # also returns None, because the field no longer exists
>>>
What happens if we try to set a numerical value using our ‘My First Field’ field?
>>> record.set_value(field, 12.4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "..../datarecord/datarecord.py", line 433, in set_value
self._set(field, value, by_value=True)
File "..../datarecord/datarecord.py", line 469, in _set
format(value, type_name))
ValueError: Invalid value '12.4' for field type String
Why did this fail? Because our field was created with a Types.String argument, which enforces that only string data can be stored for that field. OERecord
is very fussy about types, and for a good reason. One of
the most challenging aspects of data handling is knowing the specific types of
data you are working with, and being consistent with those types. How many
times have you worked with spreadsheet columns that had a mixture of strings and numbers, and how fun is it to sort that out?
To put another value onto our record, we’ll create another OEField
object. This time, let’s make it a floating point number.
>>> score_field = OEField('Score', Types.Float)
>>> record.set_value(score_field, 21.4)
>>> record.get_value(score_field)
21.4
>>> record
OERecord (
My First Field(String) : Hello World!
Score(Float) : 21.4
)
We can see the contents of the record by printing it. Notice that when you
create an OEField
object, there is no argument for an
OERecord
object. The field objects are completely independent of
the records they are used with. This allows us to do something interesting.
property_field = OEField('My Property', Types.Float)
mol_field = OEField('Molecule', Types.Chem.Mol)
def my_function(record):
mol = record.get_value(mol_field) # pulls a molecule off of the record
value = go_calculate_a_property(mol)
record.set_value(property_field, value)
In this example, there are two OEField
objects that are created
and reused on any number of records. We don’t need to create field objects for
each record – we create them once and reuse them. This pattern helps to create
uniform data sets, because every record with a ‘My Property’ field on it will
have values of identical, known type. This becomes important in a Floe
environment, where data needs to be communicated between processes, and in
Orion, where data consistency within columns in a dataset is very useful
for all kinds of analyses.
Limitations
- There are a few limitations to
OERecord
you should know about. Fields on a record are unique by name. If you set a value on a record using a field whose name matches an existing field, but whose type is different, you’ll get a warning that the type of an existing field was changed.
Only the UTF-8 encoding scheme for strings is supported.
Floating point values are stored as 64-bit double-precision numbers.
Int types are signed 64-bit values. If you need larger ints, Types.ReallyBigInt should do the trick. It handles integers up to 1023 bits long. That is really big.
Records are limited to 140 characters. Just kidding! Records, and the individual data items on them, are essentially unlimited in size. However, keep in mind that records must fit into the memory of the machines they are used on.
Types
We’ve seen string fields, numeric fields, and molecule fields. For a full list of available types, see the oechem documentation list of types.
Metadata
There are really three types of information that an OEField
can
have. In addition to a name and a data type, a field can hold metadata, or
additional “data about the data”. Metadata can be used to convey additional
information about the values in a field. For example, the constant
Meta.Hints.Categorical can be used to indicate that the values from a specific
field are categorical. Meta.Units.Mass.mg indicates that the values from a
field are in milligrams. Other metadata values can be used to express
relationships between different fields on a record, display hints for
representing values, interpretation information, and other aspects of your
data. The available metadata values can be seen in the reference documentation
for the Meta
class.
The OEField
constructor method takes an optional argument,
an OEFieldMeta
object, which holds two types of metadata.
Options are metadata keys without values, and attributes are metadata keys with values. To create an OEField
object with metadata, you first construct an
OEFieldMeta
object, assign some options or attributes to it, then
pass it to the OEField
constructor. As with data types, metadata
values are intended to be uniform for entire columns of data.
>>> from datarecord import OEFieldMeta, Meta
>>> meta = OEFieldMeta()
>>> meta.set_option(Meta.Hints.Chem.InChiKey)
>>> meta.set_option(Meta.Display.Hidden)
>>> inchi_key_field = OEField('InChi Key', Types.String, meta=meta)
>>> record.set_value(inchi_key_field, 'RYYVLZVUVIJVGH-VMIGTVKRSA-N')
In the above example, a String field is created with two pieces of metadata. The first indicates that the value should be interpreted as an InChi key. The second is a hint to a user interface that this column should be hidden by default.
The Meta
class is a large, hierarchical collection of constants
that can be used in field metadata. You can explore the hierarchy in a Python
interpreter.
>>> from datarecord import Meta
>>> Meta.list()
Categories: Annotation, Constraints, Display, Flags, Hints, Limits, Relations, Source, Units
>>> Meta.Units.list()
Categories: Concentration, Energy, Frequency, Length, Mass, Temp, Time, Volume
>>> Meta.Units.Mass.list()
Values: g, kg, mg, ng, ug
>>> Meta.find('kg')
Units.Concentration.mg_per_kg, Units.Mass.kg
Special Handling of Units
As shown above, it is possible to specify the units of a field using an
OEFieldMeta
object. There is a shortcut for specifying units
metadata for an OEField
object, using the units named parameter.
>>> field = OEField('Mass', Types.Float, units=Meta.Units.Mass.kg)
>>> field.get_meta().to_dict()
{"options": ["Units.Mass.kg"]}
>>> # or, even more simply:
>>> field_2 = OEField('Mass', Types.Float, units='kg')
>>> field_2.get_meta().to_dict()
{"options": ["Units.Mass.kg"]}
Does this mean you can specify anything for the units string? Sadly, no. The string you provide must match one of the constants in the Meta.Units class. Using any unrecognized units will raise an exception.
Working with Molecules
As we’ve seen, molecules can be stored on records along with other data types.
An OERecord
may contain one molecule field, many molecule fields,
or no molecule fields at all. Molecules don’t get any special treatment – they are just like every other piece of data on a record.
There are many cases where it is convenient to assume that a record has a
single molecule of interest, and that the other data on the record is mainly
about this molecule. There is a metadata option, Meta.Hints.Chem.PrimaryMol,
that can be used to mark a field as the “primary molecule” on a record, but
finding this field can be tricky if you’ve received a record and don’t know
the name of the special molecule field. There is a subclass of
OERecord
, called OEMolRecord
, which makes this and
other handling of molecules easier. OEMolRecord
also has methods
for dealing with data associated with conformers, atoms, and bonds.
The example below illustrates three different ways to set and get a primary molecule on a record, starting with the most tedious approach.
from datarecord import OERecord, OEMolRecord, Types, Meta, OEPrimaryMolField
# Approach 1: Create a field with the right type and metadata
def approach_1(record):
meta = OEFieldMeta()
meta.set_option(Meta.Hints.Chem.PrimaryMol)
mol_field = OEField('Molecule', Types.Chem.Mol, meta=meta)
mol = record.get_value(mol_field)
modify_molecule_somehow(mol)
record.set_value(mol_field, mol)
# Approach 2: Use the magic OEPrimaryMolField object, which finds the
# "primary molecule" field from metadata
def approach_2(record):
record.get_value(OEPrimaryMolField(), mol)
modify_molecule_somehow(mol)
record.set_value(OEPrimaryMolField(), mol)
# Approach 3: Use convenience methods on OEMolRecord. This method takes
# an OEMolRecord instead of an OERecord
def approach_3(mol_record):
mol = mol_record.get_mol()
modify_molecule_somehow(mol)
mol_record.set_mol(mol)
Conformer Data
The OEMolRecord
gives convenient access to the primary molecule
on a record, but much of the specialization of the molecule record is to
provide a way to keep data for conformers.
The general concept is that each conformer can get its own OERecord
object, that can be filled with whatever data needs to be associated with that
conformer. OEMolRecord
has methods to set, get, check for, and
delete these special conformer records. The following code illustrates the
process of getting a molecule from a record, assigning data to its conformers,
and updating the record.
from datarecord import OEMolRecord, OEField
score_field = OEField('Score', Types.Float)
def set_data_for_conformers(mol_record):
mol = mol_record.get_mol()
for conf in mol.GetConfs()
value = calculate_some_score(conf)
# Get a record containing data for this conformer
conformer_data = mol_record.get_conf_record(conf)
# Add some data to that record
conformer_data.set_value(score_field, value)
# Put the conformer's record back onto the parent record
mol_record.set_conf_record(conf, conformer_data)
mol_record.set_mol(mol) # THIS IS IMPORTANT. SEE BELOW.
In this example, we stored just one piece of data for each conformer, but the conformer’s record can hold any number of fields.
Note
In the above example, it is imperative that the molecule be set on the record after setting conformer data. The gory details are that two things are happening that you don’t see. First, the get_mol method, like get_value, returns a copy of the molecule from the record. Second, the set_conf_record method secretly marks the conformer objects (on the copied molecule) with bookkeeping information. If the molecule is not re-added to the record, the bookkeeping information is lost, and the conformer data will not be retained on the record.
Another solution is to use the get_mol_reference method, instead of get_mol. This returns the actual molecule from the record and not a copy.
Atom and Bond Data
The good news here is that atom and bond data are handled exactly the same way as conformer data. Substitute “atom” or “bond” for “conf” above, and everything should just work.
Advanced Topics
References vs. Copies
The datarecord.OERecord.get_value
and datarecord.OERecord.set_value
methods,
when used with fields containing object types, such as molecules and records,
create copies of the values on the record. This has implications for both
performance and behavior in certain cases. Some types of objects are relatively
expensive to copy, and when the cost of making a copy is comparable to the
length of a fast calculation, this expense can be a concern.
An alternative to using datarecord.OERecord.get_value
is to use datarecord.OERecord.get_reference
. Instead of returning an object copy, this
method returns a reference. While using references can avoid copy operations,
thereby improving performance, there are come caveats to their use. Modifying
an object after getting its reference on a record is actually modifying the
object on the record. Forgetting this fact can lead to surprises. Doing the following:
molecule = record.get_reference(mol_field)
...
molecule.DeleteConfs()
will leave a molecule on the record without any conformers.
Examining Fields on Records
Three methods exist for examining fields on a record. Two of them are on
OERecord
and one on OEField
.
datarecord.OERecord.get_field()
can be used to retrieve a field object from a
record by name. datarecord.OERecord.get_fields
is a generator method yielding
copies all of the OEField
objects from a record. Both
methods have an include_meta parameter to indicate whether the field metadata
should be included on the returned field objects.
OEField
has datarecord.OEField.get_meta
and datarecord.OEField.set_meta
methods which
return or set the field’s metadata in an OEFieldMeta
object.
Match Fields
Sometimes it is desirable to locate fields on a record when you know the data
type of the field but not the name. A subclass of OEField
, called
OEMatchField
allows retrieval of data based on its type and its
metadata. An OEMatchField
acts as a surrogate for the first matching
field on a record. The following example accesses data for a String field
with the Meta.Flags.ID metadata in two different ways.
from datarecord import OERecord, Meta, OEMatchField, OEFieldMeta
# First the hard way
def get_ID_value(record):
for field in record.get_fields(include_meta=True):
if field.get_type() is Types.String:
meta = field.get_meta()
if meta.has_option(Meta.Flags.ID):
return record.get_value(field)
return None
# Now the easier way
meta = OEFieldMeta(options=Meta.Flags.ID)
id_field = OEMatchField('ID', Types.String, meta=meta)
def get_ID_value(record):
return record.get_value(id_field)
Custom Data Types
There may be cases where you need to put a custom data type onto a record.
The preferred way to do this is to create a handler class derived from
CustomHandler
. The derived class must implement four methods:
datarecord.CustomHandler.get_name
returns the name of your custom type.
datarecord.CustomHandler.validate
is called by the set_value method, and should return True if the object passed is appropriate for your custom type.
datarecord.CustomHandler.serialize
converts your custom object into bytes.
datarecord.CustomHandler.deserialize
creates an instance of your custom object from bytes.
Optionally, you can implement a datarecord.CustomHandler.copy
method to
return a copy of your object. If you don’t provide an implementation, the record will use a default (and possibly slow) implementation.
Once you have created your custom handler class, you can pass it to the
OEField
constructor as you would any other type, to create a field
capable of storing your custom data type onto a record.
The following code creates a custom type handler for a Python dict object.
from datarecord import OERecord, OEField, Types, CustomHandler
import json
class MyHandler(CustomHandler):
@staticmethod
def get_name():
return 'DictType'
@staticmethod
def serialize(thing):
# This method must convert my object (dict) into bytes
return json.dumps(thing).encode()
@staticmethod
def deserialize(byte_data):
# Create my object (dict) from bytes
return json.loads(byte_data.decode())
@classmethod
def validate(cls, thing):
if not isinstance(thing, dict):
return False
# Make sure our object will serialize
try:
MyHandler.serialize(thing)
return True
except ValueError:
return False
We can now use our custom handler to read and write a dict object:
>>> record = OERecord()
>>> field = OEField('My Custom Field', MyHandler)
>>> test_object = {'question': 'what is your quest?'}
>>> record.set_value(field, test_object)
>>> record.get_value(field)
{'question': 'what is your quest?'}
Missing Values
By default, OEField
objects will not store None values on a
record. There are cases, however, where a distinction is made between a value
that was never set on a record and a value set to None.
OEField
has a keyword argument nullable, that imparts some
special behavior to the field.
A nullable field can be used to set a None value on a record. (A non-nullable field will raise an exception if you try this.)
A nullable field can retrieve a None value from a record.
If a null value is written with a nullable field, and read with a non-nullable field, the result will be None for Int and Bool types or a default value for most other types. See the following table.
None values in list types (Types.StringVec, Types.FloatVec, etc.) are treated similarly to scalar types on read. A missing list value will be converted to a default value when read by a non-nullable field.
Type Name |
Return Value |
---|---|
Types.Int |
None |
Types.Float |
NaN (math.nan) |
Types.Bool |
None |
Types.String |
An empty string |
Types.Range |
The range [-inf, inf] |
Types.ReallyBigInt |
None |
Types.Chem.Mol |
An empty OEMol object |
Types.Chem.Surface |
An empty OESurface object |
Types.Chem.Grid |
An empty OEScalarGrid object |
Types.Chem.FingerPrint |
An empty OEFingerPrint object |
Types.Record |
An empty OERecord object |
(lists of the above types) |
An empty list object [] |
What is the difference between a value that was never set on a field and a
value that was set to None? For a nullable field, a record’s
datarecord.OERecord.get_value
method will return None in either case, but
other methods can be used to tell the difference.
Method |
Normal |
Unset |
Null |
---|---|---|---|
Nullable Fields |
|||
record.has_value(field) |
True |
False |
True |
record.get_value(field) |
(Value) |
None |
None |
record.is_na(field) |
False |
False |
True |
Non-nullable Fields |
|||
record.has_value(field) |
True |
False |
True |
record.get_value(field) |
(Value) |
None |
(Default values, see above) |
record.is_na(field) |
False |
False |
True |