Tagged Data Manipulation

SD Tagged Data Manipulation

Meta information about a molecule is stored in what is known as ‘tagged data’. The most common example of this is the data fields found in SDF files. Since SD files are a common form of data storage and transfer from one system to another, OEChem provides several methods to manipulate this data. OEChem also provides round tripping SD data through the OEBinary, .oeb, and comma-separated-values, .csv file formats.

A simple class, OESDDataPair is used to set and retrieve SD data.

Data Set method Get method
tag OESDDataPair.SetTag OESDDataPair.GetTag
data OESDDataPair.SetValue OESDDataPair.GetValue

The following functions provide access to the SD data.

Functions to manipulate SD data
Function Description
OESetSDData set a tag and value data pair
OEAddSDData add a tag and value data pair
OEHasSDData determine whether a molecule has data with a given tag
OEGetSDData get the value for the given tag
OEGetSDDataPairs return an iterator over all the SD data pairs of the molecule
OECopySDData copy the entire set of SD data from one molecule to an other
OEDeleteSDData delete all SD data items with the given tag
OEClearSDData clear all SD data from a molecule

Since OESDDataPair stores SD data information in a string, numeric values have to be converted to a string with Python’s str() method before storing a value with either OESetSDData or OEAddSDData functions.

OESetSDData(mol, "number of atoms", str(mol.NumAtoms()))

Similarly the retrieved string can be converted to a numeric value with int() or float() functions.

if OEHasSDData(mol, "weight"):
    weight = float(OEGetSDData(mol, "weight"))
    print ("weight=", weight)

The following example shows how to manipulate SD tagged data.

Listing 1: SD data manipulation

#!/usr/bin/env python
from __future__ import print_function
from openeye.oechem import *


def DumpSDData(mol):
    print ("SD data of", mol.GetTitle())
    #loop over SD data
    for dp in OEGetSDDataPairs(mol):
        print (dp.GetTag(), ':', dp.GetValue())
    print ()

mol = OEGraphMol()
OESmilesToMol(mol, "c1ccccc1")
mol.SetTitle("benzene")

# set some tagged data
OESetSDData(mol, "color", "brown")
OESetSDData(mol, OESDDataPair("size", "small"))
DumpSDData(mol)

# check for existence of data, then delete it
if OEHasSDData(mol, "size"):
    OEDeleteSDData(mol, "size")
DumpSDData(mol)

# add additional color data
OEAddSDData(mol, "color", "black")
DumpSDData(mol)

# remove all SD data
OEClearSDData(mol)
DumpSDData(mol)

Note

If a data with the same tag already exists:

The output of the preceding program is the following:

SD data of benzene
color : brown
size : small

SD data of benzene
color : brown

SD data of benzene
color : brown
color : black

SD data of benzene

Note

Note that SD tagged data is specific to MDL’s SD file format. Any SD data added to a molecule will only be written out to SD files, OEBinary files, or CSV files. The SD data fields will only be filled when reading from SD files that contain SD tagged data, CSV data files, or from OEBinary files previously created to contain this data.

See also

CSV File Format

The CSV, comma-separated-values, file format is a comma file format used for data exchange in a wide variety of software. Other the years, the popularity of the format has out-stripped the robustness of many of its implementations leading to some subtle differences in how it is implemented between different software packages. This has led to creation and adoption of RFC 4180 as a standard for the CSV format. OEChem supports this standard as the base format for reading and writing molecules, OEMolBase objects, to .csv files. OEChem will read and write .csv files in the following layout:

  1. The first line can optionally contain a header to use as the SD tag.
    • OEChem decides if the first line is a header by trying to interpret the first column of the first line as SMILES. See the OEIFlavor_CSV_Header flavor.
  2. The first column is expected to be a SMILES string representing the molecule.
  3. The second column is the molecule title, as accessed through OEMolBase.GetTitle.
  4. Remaining columns will be read into SD data fields. The data is then accessible through OEGetSDData and OESetSDData.
    • If the file did not contain a header line, the SD tags will be named in the following pattern: OE_CSV_COLUMN_1, OE_CSV_COLUMN_2, ..., OE_CSV_COLUMN_N.
  5. New-line characters are not supported inside a CSV record. They are round-tripped using the pipe character, |, instead.

Point #4 is a subtlety of meta-data interchange between SD files and CSV files. On the one hand, many CSV implementations do not support line-breaks inside CSV fields, Microsoft Excel does not and neither does OEChem. However, it is common to use multiple lines inside an SD data file like the following:

>  <activ_class>
Antiarthritic
Antiinflammatory
Cyclooxygenase-2 Inhibitor

OEChem implements a solution to this problem that ChemAxon has implemented and documented here. The solution is to use the pipe character, |, to represent line breaks inside a data field. For example, the previous example of a multi-line SD data field would look like the following:

SMILES,TITLE,active_class
...,...,Antiarthritic|Antiinflammatory|Cyclooxygenase-2 Inhibitor

This does cause a different problem for pipe characters inside data fields: how pipe characters are escaped? OEChem works around this problem by embedding the same semantics for CSV parsing to parse multi-line fields, just using the pipe character as the separator and the single quote character as the quoting escape character. Listing 2 demonstrates an example that writes out a tricky set of data to the OEFormat_CSV format.

Listing 2: Writing out SD data as CSV

#!/usr/bin/env python
from openeye.oechem import *

ofs = oemolostream()
ofs.SetFormat(OEFormat_CSV)

mol = OEGraphMol()
OESmilesToMol(mol, "O")
mol.SetTitle("water")

OESetSDData(mol, "phases", "gas,liquid,solid")
OESetSDData(mol, "uses", "bathing\nwater guns\ntea|earl grey|hot")

OEWriteMolecule(ofs, mol)

The output of Listing 2 will look like the following:

SMILES,TITLE,phases,uses
O,water,"gas,liquid,solid",bathing|water guns|'tea|earl grey|hot'

PDB Tagged Data Manipulation

The OEPDBDataPair class is used to set and retrieve PDB data pairs.

Data Set method Get method
tag OEPDBDataPair.SetTag OEPDBDataPair.GetTag
data OEPDBDataPair.SetValue OEPDBDataPair.GetValue

If you wish to store a numeric value, use Python’s str() method to convert it to a string and then use int() or float() on the value when retrieving the data (see examples in SD Tagged Data Manipulation).

The following functions provide access to the PDB data.

Functions to manipulate PDB data
Function Description
OESetPDBData set a tag and value data pair
OEAddPDBData add a tag and value data pair
OEHasPDBData determine whether a molecule has data with a given tag
OEGetPDBData get the value for the given tag
OEGetPDBDataPairs return an iterator over all the PDB data pairs of the molecule
OECopyPDBData copy the entire set of PDB data from one molecule to an other
OEDeletePDBData delete all PDB data items with the given tag
OEClearSDData clear all PDB data from a molecule

Note

In case of PDB header items like REMARK, each line is treated as a separate instance, Therefore these multiple lines have to be added with OEAddPDBData and can be accessed via OEGetPDBDataPairs.

The following PDB fields are stored as tagged PDB data when OEIFlavor_PDB_DATA input flavor is set:

Imported PDB data fields
AUTHOR CAVEAT COMPND CRYST1 DBREF
EXPDTA FORMUL HEADER HELIX HET
HETNAM HETSYM JRNL KEYWDS MODRES
MTRIX1 MTRIX2 MTRIX3 OBSLTE ORIGX1
ORIGX2 ORIGX3 REMARK REVDAT SCALE1
SCALE2 SCALE3 SEQRES SEQADV SHEET
SITE SOURCE SPRSDE SSBOND TITLE
TURN        

Warning

The tags of PDB data are always 6 character long and space-padded (for example "HELIX " and not "HELIX" ).

The following example shows how to manipulate PDB tagged data.

Listing 3: PDB data manipulation

#!/usr/bin/env python
from __future__ import print_function
from openeye.oechem import *
import sys

if len(sys.argv) != 2:
    OEThrow.Usage("%s <pdbfile>" % sys.argv[0])

ifs = oemolistream()
if not ifs.open(sys.argv[1]):
    OEThrow.Fatal("Unable to open %s" % sys.argv[1])

# need to set input flavor to ensure PDB data is stored on molecule
ifs.SetFlavor(OEFormat_PDB, OEIFlavor_Generic_Default |
                            OEIFlavor_PDB_Default | OEIFlavor_PDB_DATA)

mol = OEGraphMol()
while OEReadMolecule(ifs, mol):
    if OEHasPDBData(mol, "COMPND"):
        print ("COMPND:")
        print (OEGetPDBData(mol, "COMPND"))

    if OEHasPDBData(mol, "HELIX "):
        print ("HELIX:")
        print (OEGetPDBData(mol, "HELIX "))

    if OEHasPDBData(mol, "SSBOND"):
        print ("SSBOND:")
        for dp in OEGetPDBDataPairs(mol):
            if dp.GetTag() == "SSBOND":
                print (dp.GetValue())

The output of the preceding program for 1D1H is the following:

COMPND:
    MOL_ID: 1;
HELIX:
   1   1 THR A   11  ASP A   14  5
SSBOND:
   1 CYS A    2    CYS A   16                          1555   1555
   2 CYS A    9    CYS A   21                          1555   1555
   3 CYS A   15    CYS A   28                          1555   1555

Note

Note that PDB tagged data is specific to PDB file format. Any PDB data added to a molecule will only be written out to PDB files or OEBinary files.

See also

Multi-conformer molecules

See also

For using tag data with multi-conformer molecules, see Dude, where’s my SD data?.