Tagged Data Manipulation¶
SD Tagged Data Manipulation¶
Meta information about a molecule is stored in what is known as
‘tagged data’. The most common example of this is the data fields
found in SDF files. Since SD files are a common form of data storage
and transfer from one system to another, OEChem TK provides several
methods to manipulate this data. OEChem TK also provides round
tripping SD data through the OEBinary, .oeb
, and
comma-separated-values, .csv
file formats.
A simple class, OESDDataPair is used to set and retrieve SD data.
Data |
Set method |
Get method |
---|---|---|
tag |
||
data |
The following functions provide access to the SD data.
Function |
Description |
---|---|
determines whether a file format support SD data |
|
set a tag and value data pair |
|
add a tag and value data pair |
|
determine whether a molecule has data with a given tag |
|
get the value for the given tag |
|
return an iterator over all the SD data pairs of the molecule |
|
copy the entire set of SD data from one molecule to another |
|
delete all SD data items with the given tag |
|
clear all SD data from a molecule |
Since OESDDataPair stores SD data information in a string,
numeric values have to be converted to a string with
OENumberToString
function
before storing a value with either OESetSDData
or
OEAddSDData
functions.
const string tag = "number of atoms";
OESetSDData(mol, tag, OENumberToString(mol.NumAtoms()));
Similarly the retrieved string can be converted to a numeric value with
OEStringToNumber
function.
if (OEHasSDData(mol, "weight"))
{
double weight;
OEStringToNumber(OEGetSDData(mol, "weight"), weight);
cout << "weight= " << weight << endl;
}
The following example shows how to manipulate SD tagged data.
Listing 1: SD data manipulation
#include <openeye.h>
#include <iostream>
#include <oechem.h>
using namespace std;
using namespace OESystem;
using namespace OEChem;
void DumpSDData(const OEMolBase& mol)
{
cout << "SD data of " << mol.GetTitle() << endl;
for (OEIter<const OESDDataPair> dp = OEGetSDDataPairs(mol); dp; ++dp)
cout << dp->GetTag() << " : " << dp->GetValue() << endl;
cout << endl;
}
int main()
{
OEGraphMol mol;
OESmilesToMol(mol,"c1ccccc1");
mol.SetTitle("benzene");
// set some sd data
const string tag_color = "color";
OESetSDData(mol, tag_color, "brown");
const string tag_size = "size";
OESetSDData(mol, OESDDataPair(tag_size, "small"));
DumpSDData(mol);
// check for existence of data, then delete it
if (OEHasSDData(mol, tag_size))
OEDeleteSDData(mol, tag_size);
DumpSDData(mol);
// add additional color data
OEAddSDData(mol, tag_color, "black");
DumpSDData(mol);
// remove all SD data
OEClearSDData(mol);
DumpSDData(mol);
return 0;
}
Note
If a data with the same tag already exists:
it is replaced, when using
OESetSDData
functionit is added, when using
OEAddSDData
function
The output of the preceding program is the following:
SD data of benzene
color : brown
size : small
SD data of benzene
color : brown
SD data of benzene
color : brown
color : black
SD data of benzene
Note
Note that SD tagged data is specific to MDL’s SD file format. Any SD data added to a molecule will only be written out to SD files, OEBinary files, or CSV files. The SD data fields will only be filled when reading from SD files that contain SD tagged data, CSV data files, or from OEBinary files previously created to contain this data.
See also
Generic Data chapter
CSV File Format¶
The CSV, comma-separated-values, file format is a comma file format
used for data exchange in a wide variety of software. Over the years,
the popularity of the format has out-stripped the robustness of many
of its implementations leading to some subtle differences in how it is
implemented between different software packages. This has led to
creation and adoption of RFC 4180 as a standard for the CSV
format. OEChem TK supports this standard as the base format for reading
and writing molecules, OEMolBase objects, to
.csv
files. OEChem TK will read and write .csv
files
in the following layout:
The first line can optionally contain a header to use as the SD tag.
OEChem TK decides if the first line is a header by trying to interpret the first column of the first line as SMILES. See the
OEIFlavor::CSV::Header
flavor.
The first column is expected to be a SMILES string representing the molecule.
The second column is the molecule title, as accessed through
OEMolBase::GetTitle
.Remaining columns will be read into SD data fields. The data is then accessible through
OEGetSDData
andOESetSDData
.If the file did not contain a header line, the SD tags will be named in the following pattern:
OE_CSV_COLUMN_1
,OE_CSV_COLUMN_2
, …,OE_CSV_COLUMN_N
.
New-line characters are not supported inside a CSV record. They are round-tripped using the pipe character,
|
, instead.
Point #5 is a subtlety of meta-data interchange between SD files and CSV files. On the one hand, many CSV implementations do not support line-breaks inside CSV fields, Microsoft Excel does not and neither does OEChem TK. However, it is common to use multiple lines inside an SD data file like the following:
> <activ_class>
Antiarthritic
Antiinflammatory
Cyclooxygenase-2 Inhibitor
OEChem TK implements a solution to this problem that ChemAxon has
implemented and documented here. The solution is to use the pipe character, |
,
to represent line breaks inside a data field. For example, the
previous example of a multi-line SD data field would look like the
following:
SMILES,TITLE,active_class
...,...,Antiarthritic|Antiinflammatory|Cyclooxygenase-2 Inhibitor
This does cause a different problem for pipe characters inside data
fields: how pipe characters are escaped? OEChem TK works around this
problem by embedding the same semantics for CSV parsing to parse
multi-line fields, just using the pipe character as the separator and
the single quote character as the quoting escape
character. Listing 2
demonstrates an example that
writes out a tricky set of data to the
OEFormat::CSV
format.
Listing 2: Writing out SD data as CSV
#include <openeye.h>
#include <oechem.h>
using namespace OEChem;
int main(int, char **)
{
oemolostream ofs;
ofs.SetFormat(OEFormat::CSV);
OEGraphMol mol;
OESmilesToMol(mol, "O");
mol.SetTitle("water");
OESetSDData(mol, "phases", "gas,liquid,solid");
OESetSDData(mol, "uses", "bathing\nwater guns\ntea|earl grey|hot");
OEWriteMolecule(ofs, mol);
return 0;
}
The output of Listing 2
will look like the following:
SMILES,TITLE,phases,uses
O,water,"gas,liquid,solid",bathing|water guns|'tea|earl grey|hot'
Code Example
Reordering CSV File OpenEye Python Cookbook recipe
PDB Tagged Data Manipulation¶
The OEPDBDataPair
class is used to set
and retrieve PDB data pairs.
Data |
Set method |
Get method |
---|---|---|
tag |
||
data |
If you wish to store a numeric value, use OENumberToString
function to convert it to a string and then use OEStringToNumber
on the value when retrieving the data (see examples in SD Tagged Data Manipulation).
The following functions provide access to the PDB data.
Function |
Description |
---|---|
set a tag and value data pair |
|
add a tag and value data pair |
|
determine whether a molecule has data with a given tag |
|
get the value for the given tag |
|
return an iterator over all the PDB data pairs of the molecule |
|
copy the entire set of PDB data from one molecule to another |
|
delete all PDB data items with the given tag |
|
clear all PDB data from a molecule |
Note
In case of PDB header items like REMARK, each line is treated as a
separate instance, Therefore these multiple lines have to be added
with OEAddPDBData
and can be accessed via
OEGetPDBDataPairs
.
The following PDB fields are stored as tagged PDB data when
OEIFlavor::PDB::DATA
input flavor is set:
Warning
The tags of PDB data are always 6 characters long and space-padded
(for example "HELIX "
and not "HELIX"
).
The following example shows how to manipulate PDB tagged data.
Listing 3: PDB data manipulation
#include <openeye.h>
#include <stdlib.h>
#include <iostream>
#include <oesystem.h>
#include <oechem.h>
using namespace std;
using namespace OESystem;
using namespace OEChem;
int main(int argc, char *argv[])
{
if (argc < 2)
OEThrow.Usage("%s <pdbfile>", argv[0]);
oemolistream ifs;
if (!ifs.open(argv[1]))
OEThrow.Fatal("Unable to open %s", argv[1]);
// need to set input flavor to ensure PDB data is stored on molecule
ifs.SetFlavor(OEFormat::PDB,OEIFlavor::Generic::Default |
OEIFlavor::PDB::Default |
OEIFlavor::PDB::DATA);
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
{
if (OEHasPDBData(mol, "COMPND"))
{
cout << "COMPND:" << endl;
cout << OEGetPDBData(mol, "COMPND") << endl;
}
if (OEHasPDBData(mol, "HELIX "))
{
cout << "HELIX:" << endl;
cout << OEGetPDBData(mol, "HELIX ") << endl;
}
if (OEHasPDBData(mol, "SSBOND"))
{
cout << "SSBOND:" << endl;
for (OEIter<const OEPDBDataPair> dp = OEGetPDBDataPairs(mol); dp; ++dp)
{
if (!strcmp(dp->GetTag(), "SSBOND"))
cout << dp->GetValue() << endl;
}
}
}
return 0;
}
The output of the preceding program for 1D1H is the following:
COMPND:
MOL_ID: 1;
HELIX:
1 1 THR A 11 ASP A 14 5
SSBOND:
1 CYS A 2 CYS A 16 1555 1555
2 CYS A 9 CYS A 21 1555 1555
3 CYS A 15 CYS A 28 1555 1555
Note
Note that PDB tagged data is specific to PDB file format. Any PDB data added to a molecule will only be written out to PDB files or OEBinary files.
See also
Generic Data chapter
Multi-conformer molecules¶
See also
For using tag data with multi-conformer molecules, see Dude, where’s my SD data?.
MMCIF Metadata Manipulation¶
MMCIF metadata is not stored as tagged data in the same way as PDB data is. Instead the MMCIF metadata is stored as a single string that can be retrieved, manipulated and set or appended to.
The following functions provide access to the MMCIF data.
Function |
Description |
---|---|
sets MMCIF metadata on a molecule, overwrites if any existing metadata |
|
adds MMCIF metadata on a molecule, appends to any existing metadata |
|
determine whether a molecule has MMCIF metadata |
|
get the MMCIF metadata stored on a molecule |
|
copy the MMCIF metadata from one molecule to another |
|
clear the MMCIF metadata from a molecule |
See also
Generic Data chapter