Molecular Database Handling¶
The OEMolDatabase class provides a fundamentally
different abstraction over a molecule file than the combination of
oemolistream and
OEReadMolecule
. The central underlying principle
utilized by this class is that many operations can be performed on a
molecular file without requiring the overhead of fully parsing the
molecule record into an OEMolBase object. Instead, we
can think of a molecular file as a database that can be manipulated
with much cheaper operations than OEReadMolecule
and
OEWriteMolecule
.
Opening and Reading¶
OEMolDatabase objects provide the ability to access
any molecule in a molecular database file in constant time,
O(1)
. This is accomplished by paying the overhead of scanning the
file during the OEMolDatabase::Open
call. However,
OEMolDatabase::Open
is designed to operate extremely
fast on any molecule file format OEChem TK
supports. OEMolDatabase::Open
is usually limited by
disk bandwidth instead of parsing and perception like
OEReadMolecule
and
OEWriteMolecule
.
After a database file is opened, the memory overhead of
OEMolDatabase is minimal since no molecules are
stored in memory. Instead, the OEMolDatabase only
stores a 8 byte file offset for each molecule record in the
file. Listing 1
demonstrates how to utilize this
feature to retrieve the “Nth” molecule from a molecule file using the
OEMolDatabase::GetMolecule
method.
Listing 1: Retrieving the Nth molecule in a file
#include <openeye.h>
#include <oesystem.h>
#include <oechem.h>
using namespace OESystem;
using namespace OEChem;
int main(int argc, char *argv[])
{
if (argc != 4)
OEThrow.Usage("%s <input> <output> <index>", argv[0]);
OEMolDatabase moldb;
if (!moldb.Open(argv[1]))
OEThrow.Fatal("Unable to open %s for reading", argv[1]);
oemolostream ofs;
if (!ofs.open(argv[2]))
OEThrow.Fatal("Unable to open %s for writing", argv[2]);
unsigned int idx;
if (!OEStringToNumber(argv[3], idx))
OEThrow.Fatal("Unable to parse molecule index from: %s", argv[3]);
OEMol mol;
if (!moldb.GetMolecule(mol, idx))
OEThrow.Fatal("Unable to read a molecule from index %u", idx);
OEWriteMolecule(ofs, mol);
return 0;
}
Listing 1
checks the return value of
OEMolDatabase::GetMolecule
for false, indicating the
molecule record at that position in the file does not contain a valid
molecule. For example, molecules without any atoms are valid records
in .sdf
files.
Note
OEMolDatabase::Open
is still a O(N)
operation
in order to know the position of each molecule record in the
file. However, this method is significantly cheaper than using
OEReadMolecule
, instead being limited by hard
disk bandwidth instead of processing speed. The
OEMolDatabase::Open
method can also be sped up by
creating an associated .idx
file as described by the
Index Files section.
Direct Data Access¶
OEMolDatabase achieves much of its speed by treating
molecules as chunks of bytes instead of OEMolBase
objects. This abstraction is leaked a little bit by providing users
the ability to access the raw bytes of the molecule record as well
through the OEMolDatabase::GetMolecule
method that
takes a oemolostream. For example, the user could
pass this method a oemolostream that has been opened
with oemolostream::openstring
in order to dump the
desired bytes to an in memory buffer. Listing
2
demonstrates how to use this feature to retrieve a
subset of molecules from a database file similar to how the LIMIT
and OFFSET
keywords work for an SQL query.
Listing 2: Retrieving a subset of a file
#include <openeye.h>
#include <oesystem.h>
#include <oechem.h>
using namespace OESystem;
using namespace OEChem;
int main(int argc, char *argv[])
{
if (argc != 5)
OEThrow.Usage("%s <input> <output> <offset> <limit>", argv[0]);
OEMolDatabase moldb;
if (!moldb.Open(argv[1]))
OEThrow.Fatal("Unable to open %s for reading", argv[1]);
if (moldb.GetFormat() != OEGetFileType(OEGetFileExtension(argv[2])))
OEThrow.Fatal("Output format does not match input format: %s != %s",
OEGetFileExtension(argv[1]),
OEGetFileExtension(argv[2]));
oemolostream ofs;
if (!ofs.open(argv[2]))
OEThrow.Fatal("Unable to open %s for writing", argv[2]);
unsigned int offset;
if (!OEStringToNumber(argv[3], offset))
OEThrow.Fatal("Unable to parse offset from: %s", argv[3]);
unsigned int limit;
if (!OEStringToNumber(argv[4], limit))
OEThrow.Fatal("Unable to parse limit from: %s", argv[4]);
const auto maxIdx = offset + limit;
for (auto idx = offset; idx < maxIdx; ++idx)
moldb.WriteMolecule(ofs, idx);
return 0;
}
Note
The oemolostream must be set up to write output to
the exact same file format that the OEMolDatabase
is opened on. If file format conversion is also desired during the
read operation, the user should user
OEMolDatabase::GetMolecule
to read the molecule
into an OEMolBase and then use
OEWriteMolecule
.
Title Access¶
Molecule meta-data is often useful for manipulating databases
regardless of the molecule connection table. For this reason the
OEMolDatabase provides access to the molecule title
through the OEMolDatabase::GetTitle
method. The
OEMolDatabase::GetTitle
returns the same string that
would be returned by OEMolBase::GetTitle
if the
molecule was read in with OEReadMolecule
. The
difference is that the OEMolDatabase::GetTitle
method is more efficient because it will only parse the title from the
molecule record, and skip the rest of the bytes in the molecule
record. Listing 3
demonstrates how to use
OEMolDatabase::GetTitle
to implement a more
efficient version of the molextract
example.
Listing 3: Extract molecules by title
#include <openeye.h>
#include <oeplatform.h>
#include <oesystem.h>
#include <oechem.h>
using namespace OEPlatform;
using namespace OESystem;
using namespace OEChem;
int main(int argc, char *argv[])
{
if (argc != 4)
OEThrow.Usage("%s <input> <output> <title>", argv[0]);
OEMolDatabase moldb;
if (!moldb.Open(argv[1]))
OEThrow.Fatal("Unable to open %s for reading", argv[1]);
if (moldb.GetFormat() != OEGetFileType(OEGetFileExtension(argv[2])))
OEThrow.Fatal("Output format does not match input format: %s != %s",
OEGetFileExtension(argv[1]),
OEGetFileExtension(argv[2]));
oemolostream ofs;
if (!ofs.open(argv[2]))
OEThrow.Fatal("Unable to open %s for writing", argv[2]);
const char *title = argv[3];
for (auto idx = 0u; idx < moldb.GetMaxMolIdx(); ++idx)
{
if (moldb.GetTitle(idx) == title)
moldb.WriteMolecule(ofs, idx);
}
return 0;
}
Note
Multi-conformer .oeb
files can have multiple titles per
molecule record. The top-level OEMCMolBase can
have a title, as well as each OEConfBase can have
a title. OEMolDatabase::GetTitle
will only return
the title of the top-level OEMCMolBase object and
make no attempt to search for a title among the conformer data. In
practice, this is fine since OMEGA will leave the
OEMCMolBase title the same as the input file, and
append warts to the individual conformer titles.
Index Files¶
The speed of OEMolDatabase::Open
is limited by how
fast data can be read from disk. For this reason, file position
offsets can be precomputed and stored in a parallel .idx
file. OEMolDatabase::Open
will automatically detect
the presence of this file based upon the file name of the file being
opened and use those file offsets instead. For example, if the
database file is called, “my_corporate_conformers.oeb”,
OEMolDatabase::Open
will look for a file named
“my_corporate_conformers.oeb.idx” to open as an index file. If an index
file can not be located, a full file scan will occur instead. For
files written once, and read many times, it can be highly beneficial
to create a parallel index file with the
OECreateMolDatabaseIdx
function.
Note
OEMolDatabase::Save
will automatically create a
.idx
file parallel to the file being saved. This behavior
can be modified by the
OEMolDatabaseSaveOptions::SetWriteIdx
method on
the OEMolDatabaseSaveOptions options class.
See also
OEGetMolDatabaseIdxFileName
for the default way index file names are created.
Database Generic Data¶
OEMolDatabase inherits from
OEBase, allowing it to contain and round-trip
generic data as described in the Generic Data
chapter. This data is only written and read from .oeb
files. It is stored in the .oeb
file as a
OEHeader record at the beginning of the
file. OEMolDatabase::Save
will write this record
back out the .oeb
file so that it be read by a subsequent
OEMolDatabase::Open
operation.
Caveats¶
Warning
OEMolDatabase will make an uncompressed copy of a
molecular database file when opened on a .gz
, GZipped
file. The temporary file will be deleted upon the destruction of
the OEMolDatabase object, however, it is still
recommended to not use .gz
files with
OEMolDatabase if it can be avoided. If a
different directory is desired for the uncompressed file, alter the
environment variables used by the
OEFileTempPath
function.
The largest caveat when working with OEMolDatabase objects is that they require a file on disk to provide storage. This allows the object to have a very small in-memory footprint at the expense of higher latency access to individual molecule records. The OEMolDatabase object makes no attempt to cache molecules that may be frequently read, instead leaving this up to the user at a higher level, or up to the operating system to cache frequently accessed disk pages.
The above reason is why .gz
files are not well supported. The
OEMolDatabase needs to be able to read a molecule
record by seeking to a particular location in the file. Though the
result is that multi-threaded applications can efficiently call
const
OEMolDatabase methods like
OEMolDatabase::GetMolecule
without any
synchronization overhead.