Molecular Database Handling¶
The OEMolDatabase class provides a fundamentally different abstraction over a molecule file than the combination of oemolistream and OEReadMolecule. The central underlying principle utilized by this class is that many operations can be performed on a molecular file without requiring the overhead of fully parsing the molecule record into an OEMolBase object. Instead, we can think of a molecular file as a database that can be manipulated with much cheaper operations than OEReadMolecule and OEWriteMolecule.
Opening and Reading¶
OEMolDatabase objects provide the ability to access any molecule in a molecular database file in constant time, O(1). This is accomplished by paying the overhead of scanning the file during the OEMolDatabase.Open call. However, OEMolDatabase.Open is designed to operate extremely fast on any molecule file format OEChem TK supports. OEMolDatabase.Open is usually limited by disk bandwidth instead of parsing and perception like OEReadMolecule and OEWriteMolecule.
After a database file is opened, the memory overhead of OEMolDatabase is minimal since no molecules are stored in memory. Instead, the OEMolDatabase only stores a 8 byte file offset for each molecule record in the file. Listing 1 demonstrates how to utilize this feature to retrieve the “Nth” molecule from a molecule file using the OEMolDatabase.GetMolecule method.
Listing 1: Retrieving the Nth molecule in a file
from openeye import oechem import sys if len(sys.argv) != 4: oechem.OEThrow.Usage("%s <input> <output> <index>" % sys.argv) moldb = oechem.OEMolDatabase() if not moldb.Open(sys.argv): oechem.OEThrow.Fatal("Unable to open %s for reading" % sys.argv) ofs = oechem.oemolostream() if not ofs.open(sys.argv): oechem.OEThrow.Fatal("Unable to open %s for writing" % sys.argv) idx = int(sys.argv) mol = oechem.OEMol() if not moldb.GetMolecule(mol, idx): oechem.OEThrow.Fatal("Unable to read a molecule from index %u" % idx) oechem.OEWriteMolecule(ofs, mol)
Listing 1 checks the return value of OEMolDatabase.GetMolecule for false, indicating the molecule record at that position in the file does not contain a valid molecule. For example, molecules without any atoms are valid records in .sdf files.
OEMolDatabase.Open is still a O(N) operation in order to know the position of each molecule record in the file. However, this method is significantly cheaper than using OEReadMolecule, instead being limited by hard disk bandwidth instead of processing speed. The OEMolDatabase.Open method can also be sped up by creating an associated .idx file as described by the Index Files section.
Direct Data Access¶
OEMolDatabase achieves much of its speed by treating molecules as chunks of bytes instead of OEMolBase objects. This abstraction is leaked a little bit by providing users the ability to access the raw bytes of the molecule record as well through the OEMolDatabase.GetMolecule method that takes a oemolostream. For example, the user could pass this method a oemolostream that has been opened with oemolostream.openstring in order to dump the desired bytes to an in memory buffer. Listing 2 demonstrates how to use this feature to retrieve a subset of molecules from a database file similar to how the LIMIT and OFFSET keywords work for an SQL query.
Listing 2: Retrieving a subset of a file
from openeye import oechem import sys if len(sys.argv) != 5: oechem.OEThrow.Usage("%s <input> <output> <offset> <limit>" % sys.argv) moldb = oechem.OEMolDatabase() if not moldb.Open(sys.argv): oechem.OEThrow.Fatal("Unable to open %s" % sys.argv) if moldb.GetFormat() != oechem.OEGetFileType(oechem.OEGetFileExtension(sys.argv)): oechem.OEThrow.Fatal("Output format does not match input format: %s != %s" % (oechem.OEGetFileExtension(sys.argv), oechem.OEGetFileExtension(sys.argv))) ofs = oechem.oemolostream() if not ofs.open(sys.argv): oechem.OEThrow.Fatal("Unable to open %s for writing" % sys.argv) offset = int(sys.argv) limit = int(sys.argv) maxIdx = offset + limit for idx in range(offset, maxIdx): moldb.WriteMolecule(ofs, idx)
The oemolostream must be set up to write output to the exact same file format that the OEMolDatabase is opened on. If file format conversion is also desired during the read operation, the user should user OEMolDatabase.GetMolecule to read the molecule into an OEMolBase and then use OEWriteMolecule.
Molecule meta-data is often useful for manipulating databases regardless of the molecule connection table. For this reason the OEMolDatabase provides access to the molecule title through the OEMolDatabase.GetTitle method. The OEMolDatabase.GetTitle returns the same string that would be returned by OEMolBase.GetTitle if the molecule was read in with OEReadMolecule. The difference is that the OEMolDatabase.GetTitle method is more efficient because it will only parse the title from the molecule record, and skip the rest of the bytes in the molecule record. Listing 3 demonstrates how to use OEMolDatabase.GetTitle to implement a more efficient version of the molextract example.
Listing 3: Extract molecules by title
from openeye import oechem import sys if len(sys.argv) != 4: oechem.OEThrow.Usage("%s <input> <output> <title>" % sys.argv) moldb = oechem.OEMolDatabase() if not moldb.Open(sys.argv): oechem.OEThrow.Fatal("Unable to open %s" % sys.argv) if moldb.GetFormat() != oechem.OEGetFileType(oechem.OEGetFileExtension(sys.argv)): oechem.OEThrow.Fatal("Output format does not match input format: %s != %s" % (oechem.OEGetFileExtension(sys.argv), oechem.OEGetFileExtension(sys.argv))) ofs = oechem.oemolostream() if not ofs.open(sys.argv): oechem.OEThrow.Fatal("Unable to open %s for writing" % sys.argv) title = sys.argv for idx in range(moldb.GetMaxMolIdx()): if moldb.GetTitle(idx) == title: moldb.WriteMolecule(ofs, idx)
Multi-conformer .oeb files can have multiple titles per molecule record. The top-level OEMCMolBase can have a title, as well as each OEConfBase can have a title. OEMolDatabase.GetTitle will only return the title of the top-level OEMCMolBase object and make no attempt to search for a title among the conformer data. In practice, this is fine since OMEGA will leave the OEMCMolBase title the same as the input file, and append warts to the individual conformer titles.
The speed of OEMolDatabase.Open is limited by how fast data can be read from disk. For this reason, file position offsets can be precomputed and stored in a parallel .idx file. OEMolDatabase.Open will automatically detect the presence of this file based upon the file name of the file being opened and use those file offsets instead. For example, if the database file is called, “my_corporate_conformers.oeb”, OEMolDatabase.Open will look for a file named “my_corporate_conformers.oeb.idx” to open as an index file. If an index file can not be located, a full file scan will occur instead. For files written once, and read many times, it can be highly beneficial to create a parallel index file with the OECreateMolDatabaseIdx function.
OEMolDatabase.Save will automatically create a .idx file parallel to the file being saved. This behavior can be modified by the OEMolDatabaseSaveOptions.SetWriteIdx method on the OEMolDatabaseSaveOptions options class.
- OEGetMolDatabaseIdxFileName for the default way index file names are created.
Database Generic Data¶
OEMolDatabase inherits from OEBase, allowing it to contain and round-trip generic data as described in the Generic Data chapter. This data is only written and read from .oeb files. It is stored in the .oeb file as a OEHeader record at the beginning of the file. OEMolDatabase.Save will write this record back out the .oeb file so that it be read by a subsequent OEMolDatabase.Open operation.
OEMolDatabase will make an uncompressed copy of a molecular database file when opened on a .gz, GZipped file. The temporary file will be deleted upon the destruction of the OEMolDatabase object, however, it is still recommended to not use .gz files with OEMolDatabase if it can be avoided. If a different directory is desired for the uncompressed file, alter the environment variables used by the OEFileTempPath function.
The largest caveat when working with OEMolDatabase objects is that they require a file on disk to provide storage. This allows the object to have a very small in-memory footprint at the expense of higher latency access to individual molecule records. The OEMolDatabase object makes no attempt to cache molecules that may be frequently read, instead leaving this up to the user at a higher level, or up to the operating system to cache frequently accessed disk pages.
The above reason is why .gz files are not well supported. The OEMolDatabase needs to be able to read a molecule record by seeking to a particular location in the file. Though the result is that multi-threaded applications can efficiently call const OEMolDatabase methods like OEMolDatabase.GetMolecule without any synchronization overhead.