Tagged Data Manipulation

SD Tagged Data Manipulation

Meta information about a molecule is stored in what is known as ‘tagged data’. The most common example of this is the data fields found in SDF files. Since SD files are a common form of data storage and transfer from one system to another, OEChem TK provides several methods to manipulate this data. OEChem TK also provides round tripping SD data through the OEBinary, .oeb, and comma-separated-values, .csv file formats.

A simple class, OESDDataPair is used to set and retrieve SD data.

Data	Set method	Get method
tag	`OESDDataPair::SetTag`	`OESDDataPair::GetTag`
data	`OESDDataPair::SetValue`	`OESDDataPair::GetValue`

The following functions provide access to the SD data.

**Functions to manipulate SD data**
Function	Description
`OEIsSDDataFormat`	determines whether a file format support SD data
`OESetSDData`	set a tag and value data pair
`OEAddSDData`	add a tag and value data pair
`OEHasSDData`	determine whether a molecule has data with a given tag
`OEGetSDData`	get the value for the given tag
`OEGetSDDataPairs`	return an iterator over all the SD data pairs of the molecule
`OECopySDData`	copy the entire set of SD data from one molecule to another
`OEDeleteSDData`	delete all SD data items with the given tag
`OEClearSDData`	clear all SD data from a molecule

Since OESDDataPair stores SD data information in a string, numeric values have to be converted to a string with OENumberToString function before storing a value with either OESetSDData or OEAddSDData functions.

const string tag = "number of atoms";
OESetSDData(mol, tag, OENumberToString(mol.NumAtoms()));

Similarly the retrieved string can be converted to a numeric value with OEStringToNumber function.

if (OEHasSDData(mol, "weight"))
{
  double weight;
  OEStringToNumber(OEGetSDData(mol, "weight"), weight);
  cout << "weight= " << weight << endl;
}

The following example shows how to manipulate SD tagged data.

Listing 1: SD data manipulation

#include <openeye.h>
#include <iostream>
#include <oechem.h>

using namespace std;
using namespace OESystem;
using namespace OEChem;

void DumpSDData(const OEMolBase& mol)
{
  cout << "SD data of " << mol.GetTitle() << endl;
  for (OEIter<const OESDDataPair> dp = OEGetSDDataPairs(mol); dp; ++dp)
    cout << dp->GetTag() << " : " << dp->GetValue() << endl;
  cout << endl;
}

int main()
{
  OEGraphMol mol;
  OESmilesToMol(mol,"c1ccccc1");
  mol.SetTitle("benzene");

  // set some sd data
  const string tag_color = "color";
  OESetSDData(mol, tag_color, "brown");
  const string tag_size = "size";
  OESetSDData(mol, OESDDataPair(tag_size, "small"));
  DumpSDData(mol);

  // check for existence of data, then delete it
  if (OEHasSDData(mol, tag_size))
    OEDeleteSDData(mol, tag_size);
  DumpSDData(mol);

  // add additional color data
  OEAddSDData(mol, tag_color, "black");
  DumpSDData(mol);

  // remove all SD data
  OEClearSDData(mol);
  DumpSDData(mol);

  return 0;
}

Note

If a data with the same tag already exists:

it is replaced, when using OESetSDData function
it is added, when using OEAddSDData function

The output of the preceding program is the following:

SD data of benzene
color : brown
size : small

SD data of benzene
color : brown

SD data of benzene
color : brown
color : black

SD data of benzene

Note

Note that SD tagged data is specific to MDL’s SD file format. Any SD data added to a molecule will only be written out to SD files, OEBinary files, or CSV files. The SD data fields will only be filled when reading from SD files that contain SD tagged data, CSV data files, or from OEBinary files previously created to contain this data.

See also

Generic Data chapter

Multi-conformer molecules

See also

For using tag data with multi-conformer molecules, see Dude, where’s my SD data?.

CSV File Format

The CSV, comma-separated-values, file format is a comma file format used for data exchange in a wide variety of software. Over the years, the popularity of the format has out-stripped the robustness of many of its implementations leading to some subtle differences in how it is implemented between different software packages. This has led to creation and adoption of RFC 4180 as a standard for the CSV format. OEChem TK supports this standard as the base format for reading and writing molecules, OEMolBase objects, to .csv files. OEChem TK will read and write .csv files in the following layout:

The first line can optionally contain a header to use as the SD tag.
- OEChem TK decides if the first line is a header by trying to interpret the first column of the first line as SMILES. See the OEIFlavor::CSV::Header flavor.
The first column is expected to be a SMILES string representing the molecule.
The second column is the molecule title, as accessed through OEMolBase::GetTitle.
Remaining columns will be read into SD data fields. The data is then accessible through OEGetSDData and OESetSDData.
- If the file did not contain a header line, the SD tags will be named in the following pattern: OE_CSV_COLUMN_1, OE_CSV_COLUMN_2, …, OE_CSV_COLUMN_N.
New-line characters are not supported inside a CSV record. They are round-tripped using the pipe character, |, instead.

Point #5 is a subtlety of meta-data interchange between SD files and CSV files. On the one hand, many CSV implementations do not support line-breaks inside CSV fields, Microsoft Excel does not and neither does OEChem TK. However, it is common to use multiple lines inside an SD data file like the following:

>  <activ_class>
Antiarthritic
Antiinflammatory
Cyclooxygenase-2 Inhibitor

OEChem TK implements a solution to this problem that ChemAxon has implemented and documented here. The solution is to use the pipe character, |, to represent line breaks inside a data field. For example, the previous example of a multi-line SD data field would look like the following:

SMILES,TITLE,active_class
...,...,Antiarthritic|Antiinflammatory|Cyclooxygenase-2 Inhibitor

This does cause a different problem for pipe characters inside data fields: how pipe characters are escaped? OEChem TK works around this problem by embedding the same semantics for CSV parsing to parse multi-line fields, just using the pipe character as the separator and the single quote character as the quoting escape character. Listing 2 demonstrates an example that writes out a tricky set of data to the OEFormat::CSV format.

Listing 2: Writing out SD data as CSV

#include <openeye.h>
#include <oechem.h>

using namespace OEChem;

int main(int, char **)
{
  oemolostream ofs;
  ofs.SetFormat(OEFormat::CSV);

  OEGraphMol mol;
  OESmilesToMol(mol, "O");
  mol.SetTitle("water");

  OESetSDData(mol, "phases", "gas,liquid,solid");
  OESetSDData(mol, "uses", "bathing\nwater guns\ntea|earl grey|hot");

  OEWriteMolecule(ofs, mol);

  return 0;
}

The output of Listing 2 will look like the following:

SMILES,TITLE,phases,uses
O,water,"gas,liquid,solid",bathing|water guns|'tea|earl grey|hot'

Code Example

Reordering CSV File OpenEye Python Cookbook recipe

PDB Tagged Data Manipulation

The OEPDBDataPair class is used to set and retrieve PDB data pairs.

Data	Set method	Get method
tag	`OEPDBDataPair::SetTag`	`OEPDBDataPair::GetTag`
data	`OEPDBDataPair::SetValue`	`OEPDBDataPair::GetValue`

If you wish to store a numeric value, use OENumberToString function to convert it to a string and then use OEStringToNumber on the value when retrieving the data (see examples in SD Tagged Data Manipulation).

The following functions provide access to the PDB data.

Functions to manipulate PDB data
Function	Description
`OESetPDBData`	set a tag and value data pair
`OEAddPDBData`	add a tag and value data pair
`OEHasPDBData`	determine whether a molecule has data with a given tag
`OEGetPDBData`	get the value for the given tag
`OEGetPDBDataPairs`	return an iterator over all the PDB data pairs of the molecule
`OECopyPDBData`	copy the entire set of PDB data from one molecule to another
`OEDeletePDBData`	delete all PDB data items with the given tag
`OEClearPDBData`	clear all PDB data from a molecule

Note

In case of PDB header items like REMARK, each line is treated as a separate instance, Therefore these multiple lines have to be added with OEAddPDBData and can be accessed via OEGetPDBDataPairs.

The following PDB fields are stored as tagged PDB data when OEIFlavor::PDB::DATA input flavor is set:

**Imported PDB data fields**
AUTHOR	CAVEAT	COMPND	CRYST1	DBREF
EXPDTA	FORMUL	HEADER	HELIX	HET
HETNAM	HETSYM	JRNL	KEYWDS	MODRES
MTRIX1	MTRIX2	MTRIX3	OBSLTE	ORIGX1
ORIGX2	ORIGX3	REMARK	REVDAT	SCALE1
SCALE2	SCALE3	SEQRES	SEQADV	SHEET
SITE	SOURCE	SPRSDE	SSBOND	TITLE
TURN

Warning

The tags of PDB data are always 6 characters long and space-padded (for example "HELIX " and not "HELIX" ).

The following example shows how to manipulate PDB tagged data.

Listing 3: PDB data manipulation

#include <openeye.h>
#include <stdlib.h>
#include <iostream>
#include <oesystem.h>
#include <oechem.h>

using namespace std;
using namespace OESystem;
using namespace OEChem;

int main(int argc, char *argv[])
{
  if (argc < 2)
    OEThrow.Usage("%s <pdbfile>", argv[0]);

  oemolistream ifs;
  if (!ifs.open(argv[1]))
    OEThrow.Fatal("Unable to open %s", argv[1]);

  // need to set input flavor to ensure PDB data is stored on molecule
  ifs.SetFlavor(OEFormat::PDB,OEIFlavor::Generic::Default |
                OEIFlavor::PDB::Default |
                OEIFlavor::PDB::DATA);

  OEGraphMol mol;
  while (OEReadMolecule(ifs, mol))
  {
    if (OEHasPDBData(mol, "COMPND"))
    {
      cout << "COMPND:" << endl;
      cout << OEGetPDBData(mol, "COMPND") << endl;
    }
    if (OEHasPDBData(mol, "HELIX "))
    {
      cout << "HELIX:" << endl;
      cout << OEGetPDBData(mol, "HELIX ") << endl;
    }
    if (OEHasPDBData(mol, "SSBOND"))
    {
      cout << "SSBOND:" << endl;
      for (OEIter<const OEPDBDataPair> dp = OEGetPDBDataPairs(mol); dp; ++dp)
      {
        if (!strcmp(dp->GetTag(), "SSBOND"))
          cout << dp->GetValue() << endl;
      }
    }
  }

  return 0;
}

The output of the preceding program for 1D1H is the following:

COMPND:
    MOL_ID: 1;
HELIX:
   1   1 THR A   11  ASP A   14  5
SSBOND:
   1 CYS A    2    CYS A   16                          1555   1555
   2 CYS A    9    CYS A   21                          1555   1555
   3 CYS A   15    CYS A   28                          1555   1555

Note

Note that PDB tagged data is specific to PDB file format. Any PDB data added to a molecule will only be written out to PDB files or OEBinary files.

See also

Generic Data chapter

CIF File Format

The CIF file format is built around keywords, terminology, and structure to allow for a robust and customizable file format for atomic coordinate and biomolecule structure files. This format has generated multiple different sub-genres of CIF file formats: Small Molecule CIF (smCIF), Macro Molecular CIF (mmCIF), Chemical Component CIF (ccCIF), and Structure Factor CIF (sfCIF). All CIF file formats follow the same general terminology. Below is a brief introduction to this terminology using an mmCIF example, but for a more complete explanation of CIF syntax, see the PDBx/mmCIF User Guide from the wwPDB.

The following example comes from the mmCIF file for the 3TPP PDB code.

data_3TPP
#
_entry.id   3TPP
#
_audit_conform.dict_name       mmcif_pdbx.dic
_audit_conform.dict_version    5.397
_audit_conform.dict_location   http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic
#
loop_
_database_2.database_id
_database_2.database_code
_database_2.pdbx_database_accession
_database_2.pdbx_DOI
PDB   3TPP         pdb_00003tpp 10.2210/pdb3tpp/pdb
RCSB  RCSB067767   ?            ?
WWPDB D_1000067767 ?            ?

Data blocks are used to organize an entire set of related data. Because all the data in this CIF file relate to the 3TPP structure, all data in this example are in the same data block. The start of a data block uses the data_ keyword at the start of the line.

CIF categories are a grouping of related information within the data block and all data within the same category will have the same category name.

In the above example, there are three categories: _entity., _audit_conform., and _database_2.. Toolkit function references to category names require these values to be formatted with a leading underscore + the category name + a trailing period. This category name nomenclature ensures consistency and reduces the potential for misidentifying categories with similar names.

Attributes name the different data types of values saved in the category, and the combination of a category and attribute is called a data item. In the example, the _database_2. category contains four attributes: database_id, database_code, pdbx_database_accession, and pdbx_DOI, and combining a category and an attribute results in a data item: _database_2.database_id.

Categories can come in two types: pair and loop. In a pair category, every data item is paired with a single value. The _audit_conform. is a pair type data item. In a loop category, multiple rows of data correlate with the data items where the number of data columns equal and correlate to the number of data items in the category. The _database_2 is a loop type category.

The OECIFData class can be used to access row and attribute (column) data in a category. For example, use the GetRow(“_database_2.”, 1) method to get the second row in the database_2 category: "RCSB", "RCSB067767", "?", "?". Use the GetAttributeValues("_database_2.", "database_id") method to get the database_id column values: "PDB", "RCSB", "WWPDB".

See also

mmCIF Metadata Manipulation

mmCIF metadata is not stored as tagged data in the same way that PDB data is. Instead, the mmCIF metadata is stored as a single string that can be retrieved, manipulated, and set or appended to.

The following functions provide access to the mmCIF data.

Functions to manipulate mmCIF metadata
Function	Description
`OESetMMCIFData`	Sets mmCIF metadata on a molecule, overwrites if any existing metadata
`OEAddMMCIFData`	Adds mmCIF metadata on a molecule, appends to any existing metadata
`OEHasMMCIFData`	Determines whether a molecule has mmCIF metadata stored
`OEGetMMCIFData`	Retrieves the mmCIF metadata stored on a molecule
`OECopyMMCIFData`	Copies the mmCIF metadata from one molecule to another
`OEClearMMCIFData`	Clears the mmCIF metadata from a molecule
`OEUpdateMMCIFData`	Updates mmCIF metadata using molecule structural information

When writing mmCIF metadata, the default behavior is to write the header as is and without modification so long as it is formatted correctly. The only exception is if important CIF sections of the header are missing, in which case they will be perceived at write time or if header data is updated. See the mmCIF Header Perception section below.

For mmCIF, these essential categories are:

_entity.
_entity_poly_seq.
_chem_comp.
_cell.
_symmetry.
_struct_asym.
_struct_conn.

For Chemical Component CIF, these essential categories are:

_chem_comp.
_chem_comp_atom.
_chem_comp_bond.

To synchronize header data with a molecule, or to manually interact with header data, the OECIFData class allows for programmatic access to the header values and includes functionality to update specific sections of the header with a molecule’s structural data. The OEUpdateMMCIFData function can be used to run a one-time update to the header metadata.

See also

mmCIF Header Perception

A molecule can contain information used to perceive data item values in the mmCIF header. This information is coordinated and can be called using the OECIFData class or OEUpdateMMCIFData function. Which data items are perceived during that update is controlled by the OECIFOptions class. This class breaks perception into seven categories: Entity, ChemComp, Cell, Symmetry, Struct, Experimental, and Other. If perception is turned on for any of these categories, category attributes and values can be altered or deleted.

The follow table outlines what CIF categories and attributes could be affected by perception during header updates:

OECIFOptions and Corresponding Perceived Categories
OECIFOptions	Related mmCIF Categories
SetPerceiveEntity	_entity.id _entity.type _entity_poly.entity_id _entity_poly.type _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_code_can _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id _entity_poly_seq.hetero
SetPerceiveChemComp	_chem_comp.id _chem_comp.type _chem_comp.mon_nstd_flag _chem_comp.name _chem_comp.pdbx_synonyms _chem_comp.formula _chem_comp.formula_weight _chem_comp_atom.comp_id _chem_comp_atom.atom_id _chem_comp_atom.type_symbol _chem_comp_atom.charge _chem_comp_bond.comp_id _chem_comp_bond.atom_id_1 _chem_comp_bond.atom_id_2 _chem_comp_bond.type _chem_comp_bond.aromaticc
SetPerceiveStruct	_struct_asym.id _struct_asym.entity_id _struct_conn. _struct. _struct_ref. _struct_ref_seq. _struct_conf. _struct_sheet. _struct_biol. _struct_keywords. _struct_ncs_opr. pdbx_struct_mod_residues.
SetPerceiveSymmetry	_symmetry.entry_id _symmetry.space_group_name_H-M
SetPerceiveCell	_cell. _atom_sites.
SetPerceiveExperimental	_exptl. _refine. _refine_analysis. _refine_is_restr. _reflns. _reflns_shell. _diffrn. _pdbx_refine_tls.
SetPerceiveOther	_audit_author. _pdbx_database_status. _database_PDB_matrix . _software.

Categories that are bold are primarily defined or affected by molecule structural data and are consequently entirely redefined when updated. The remaining categories are coordinated with other parts of header data when updating, but primarily contain attributes not perceived by molecular structure. The exhaustive list of these coordinated but not perceived categories are outside the scope of this document, and only the most common are listed.

The _entity_poly_seq. category and associated _entity_poly. attributes allow for microheterogeneity between the actual molecule sequence and the values in these categories. Consequently, these sequences are not a reflection of the molecule, but a reference to the intended polymeric sequence of molecule. This may result in unexpected behavior during perception if data is missing or inconsistent with the polymeric sequence. When disagreement between attribute values and molecule sequences are encountered, disagreeing values may be deleted from the header. For these sequence-based attributes to persist through perception, all existing values must align with the macro molecules sequence data, otherwise it is considered undefined and will not be populated. If these attribute values are missing prior to perception, they are considered unknowable and will not be populated.

It is important to note that perception and synchronization only occur with the molecule informing header data. Changes in the header will not alter the molecular structure in any way.