Molecule Reading and Writing¶
Molecule Streams¶
The previous section
demonstrated reading and writing SMILES
strings from standard input. This required the programmer to perform the file
I/O explicitly. Whilst this may be reasonable for SMILES strings that may be read via STD
getline,
it isn’t suitable for more
complex file formats. To ease this task, OEChem TK provides the
oemolstream
abstraction. The
classes oemolistream and
oemolostream allow input and output of molecules
using the OEReadMolecule
and
OEWriteMolecule
functions, respectively.
Listing 1: High-level Molecule I/O using molstreams
#include <openeye.h>
#include <oechem.h>
using namespace OEChem;
int main()
{
oemolistream ifs;
oemolostream ofs;
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
Listing 1
will read molecules from stdin in
SMILES format and write them to stdout in
absolute SMILES
format.
Notice that in this example, the following is done automatically for you by
OEReadMolecule
:
OEMolBase::Clear
method to reset the molecule.OEAssignAromaticFlags
to normalize aromaticity.OEMolBase::operator bool
to test the validity of the molecule.
Note
OEReadMolecule
will skip any invalid molecules automatically.
Since the same OEGraphMol instance is being reused
for every molecule in the stream, special precautions must be taken to
load up an entire file into memory. Listing
2
will read molecules into memory by making sure
there is a copy made when inserting the molecule into the
container. This is useful if the molecules need to be repeatedly
accessed.
Listing 2: Reading molecules into memory
#include <openeye.h>
#include <oechem.h>
#include <vector>
using namespace OEChem;
int main(int, char *argv[])
{
oemolistream ifs(argv[1]);
OEGraphMol mol;
std::vector<OEGraphMol> vecMols;
while (OEReadMolecule(ifs, mol))
vecMols.push_back(OEGraphMol(mol));
return 0;
}
File Formats¶
In addition to SMILES strings, OEChem TK is able
to read numerous other molecular file formats, including MDL
SD
files, Tripos Tripos
MOL2
files and PDB
files. The format of an input file or stream may be associated with an
oemolistream using the
SetFormat
method, and may
be retrieved with
GetFormat
. The file
format is represented by an unsigned integer, which should be one of
the constants defined in the the OEFormat namespace.
A value of OEFormat::UNDEFINED
means that there is
no file format associated with the
oemolstream
.
Note
The default format associated with an
oemolstream
is
OEFormat::SMI
.
Listing 3
demonstrates how to use
oemolstreams
to convert MDL
SD
files into Tripos
MOL2
files.
Listing 3: Explicitly setting file formats
#include <openeye.h>
#include <oechem.h>
using namespace OEChem;
int main()
{
oemolistream ifs;
oemolostream ofs;
ifs.SetFormat(OEFormat::SDF);
ofs.SetFormat(OEFormat::MOL2);
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
File Format |
Description |
Read |
Write |
---|---|---|---|
Yes |
Yes |
||
ChemDraw CDX |
Yes |
Yes |
|
Crystallographic Information File |
Yes |
Yes |
|
Comma separated values |
Yes |
Yes |
|
Chemaxon Extended SMILES |
Yes |
Yes |
|
FASTA protein sequence |
Yes |
Yes |
|
IUPAC InChI |
No |
Yes |
|
IUPAC InChI Key |
No |
Yes |
|
Yes |
Yes |
||
MDL Mol File |
Yes |
Yes |
|
Molecular Formula (Hill order) |
No |
Yes |
|
Macromolecular Crystallographic Information File |
Yes |
Yes |
|
Macromodel |
Yes |
Yes |
|
Tripos Sybyl mol2 file |
Yes |
Yes |
|
Sybyl mol2 with explicit hydrogens |
Yes |
Yes |
|
MOPAC file format(s) |
No |
Yes |
|
OpenEye OEBinary |
Yes |
Yes |
|
Zstd Compressed OpenEye OEBinary |
Yes |
Yes |
|
Protein Databank PDB file |
Yes |
Yes |
|
MDL RDF File |
Yes |
No |
|
MDL SD File |
Yes |
Yes |
|
MDL ISIS Sketch File |
Yes |
No |
|
Tripos Sybyl Line Notation |
No |
Yes |
|
Yes |
Yes |
||
Yes |
Yes |
||
XMol XYZ format |
Yes |
Yes |
File Input and Output¶
In addition to standard in and standard out, OEChem TK’s
oemolstreams
also support reading
from files. To open a file, use the
open
method with the required
filename. For input oemolistreams
, if
the file does not exist the
open
fails and returns false,
for output oemolostreams
, the output
file is created if it did not previously exist and overwritten if it
did. If a filename is passed to the open
method, the format and gz compression flags are set based on the file extension of the filename for the
oemolistream or oemolostream instance.
If no arguments are passed to
open
, an
oemolistream will use standard in, and an
oemolostream will use standard out, the format
will be set to OEFormat::Default
and gz compression will be disabled.
Much like regular file I/O, oemolstreams
can be closed after use, using the
close
method.
oemolstreams
are automatically
closed by their destructors.
Listing 4: Reading and writing molecule files
#include <openeye.h>
#include <oesystem.h>
#include <oechem.h>
using namespace OESystem;
using namespace OEChem;
int main()
{
oemolistream ifs;
oemolostream ofs;
if (!ifs.open("input.sdf"))
OEThrow.Fatal("Unable to open 'input.sdf'");
if (!ofs.open("output.mol2"))
OEThrow.Fatal("Unable to create 'output.mol2'");
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
One convenient feature of the
open
method of
oemolstreams
is that it sets the
file format and gz compression associated with the stream from the file extension of the
filename used as an argument. Listing 4
converts the file ‘input.sdf’ in MDL file format into the file
‘output.mol2’ in Tripos MOL2 file format. This behavior can be
overridden by calling
SetFormat
after the
open
but before the first
molecule is read or written to a stream.
String Input and Output¶
The oemolistream and oemolostream provide the ability to read and write memory buffers instead of files.
This is accomplished through the methods,
oemolistream::openstring
and
oemolostream::GetString
.
The Listing 5
example below demonstrates how to read and write molecules from memory buffers.
Listing 5: Reading and writing molecule from memory buffers
#include <openeye.h>
#include <oeplatform.h>
#include <oesystem.h>
#include <oechem.h>
using namespace std;
using namespace OEPlatform;
using namespace OESystem;
using namespace OEChem;
int main()
{
const char *smiles = "CCO\n" "c1cnccc1\n";
oemolistream ims;
ims.SetFormat(OEFormat::SMI);
ims.openstring(smiles);
vector<OEGraphMol> mols;
OEGraphMol imol;
while (OEReadMolecule(ims, imol))
mols.push_back(imol);
oemolostream oms;
oms.SetFormat(OEFormat::SDF);
oms.openstring();
for (auto& omol : mols)
OEWriteMolecule(oms, omol);
const string molstring = oms.GetString();
oeout << "MOL string" << oeendl;
oeout << molstring;
return 0;
}
Compressed Input and Output¶
For any of the molecular file formats supported by OEChem TK, it is often
convenient to read and write compressed
files or
strings.
Molecule streams support gzipped input and output via the zlib library. The .gz
suffix on any filename
used to open a stream is recognized and the stream is read or written
in compressed format. This mechanism does not interfere with the
format perception. For instance, foo.sdf.gz
is recognized as a
gzipped file in MDL’s SD format.
The Listing 6
example below
converts all of the molecules in a gzipped SD format file into a gzipped OEBinary
file.
Listing 6: Reading and writing compressed molecule files
#include <openeye.h>
#include <oechem.h>
using namespace OEChem;
int main()
{
oemolistream ifs("input.sdf.gz");
oemolostream ofs("output.oeb.gz");
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
The OEChem TK’s binary file format OEFormat::OEB
is
well suited (and our preferred way) to store multi-conformer molecules.
When working with multi-conformer molecules of +1M datasets, the generated files
can easily reach a size when transferring them between computers or networks can be
slow.
One way to handle this issue is to compress the files with gzip before transfer.
As the above example shows OEChem TK seamlessly handles gzipped molecule
files with gz
extension.
Alternatively, OEChem TK also provides a native compressed file format,
called OEFormat::OEZ
, with the oez
file extension.
This file format currently only supports multi-conformer molecules
derived from the OEMCMolBase class.
One of the main differences between the gzipped OEFormat::OEB
and the OEFormat::OEZ
files that while gzip compresses a
molecular file as a whole, in an OEFormat::OEZ
file,
each molecule is individually compressed using the
Zstandard compression algorithm.
This difference makes OEFormat::OEZ
suitable to work
really well with OEMolDatabase that is designed to provide
fast read-only random access to molecular files.
When an OEMolDatabase opens a gzipped molecule file,
uncompressing the whole file is necessary to access any molecule in the
dataset.
When working with a OEFormat::OEZ
file, accessing a
specific molecule in the file only triggers the un-compression of the
associated data chunk.
The table below lists the file sizes of the same dataset of 1M molecules in uncompressed and compressed formats with different numbers of conformations per molecule.
Dataset |
oeb |
oeb.gz |
oez |
---|---|---|---|
1M molecules with 10 conformations |
5.4GB |
1.9GB |
2.2GB |
1M molecules with 50 conformations |
22GB |
5.7GB |
5.5GB |
1M molecules with 200 conformations |
69GB |
16GB |
13GB |
Note
The OEFormat::OEZ
file format was introduced in the
2019.Apr
toolkits.
OpenEye applications or toolkits released before Apr/2019 will not
be able to read or write this new file format.
Command Line Format Control¶
Using the methods outlined above, it is possible to allow the stream
format to be controlled from the command line. OEChem TK’s
oemolstreams
control the format by
interpreting the input and output file names.
Listing 7: Controlling File Format from the Command Line
#include <openeye.h>
#include <oesystem.h>
#include <oechem.h>
using namespace OESystem;
using namespace OEChem;
int main(int argc, char *argv[])
{
if (argc != 3)
OEThrow.Usage("%s <input> <output>", argv[0]);
oemolistream ifs;
oemolostream ofs;
if (!ifs.open(argv[1]))
OEThrow.Fatal("Unable to open %s", argv[1]);
if (!ofs.open(argv[2]))
OEThrow.Fatal("Unable to create %s", argv[2]);
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
Listing 7
is an example of using
command-line arguments to allow OEChem TK programs to support many file
formats at run-time. For instance, if Listing
7
is a program called convert:
prompt> convert file1.sdf file1.smi
This command will convert file1.sdf from SD format to SMILES format. A first extension of this idea allows access to standard in and standard out via the “-” filename. For instance:
prompt> convert file2.mol2 -
This command will convert file2.mol2
in MOL2 format and write the
molecules to standard out in SMILES, the default format.
Thus if you have another program, GetFromDatabase, which retrieves molecules from a database and writes them in SMILES format, you can chain it with any OEChem TK program. Using your operating systems redirection commands (e.g. - Unix pipe “|” or redirect “>”) you can move molecules directly from GetFromDatabase to convert without a temporary file:
prompt> GetFromDatabase | convert - file3.sdf
This command will take the SMILES format output from GetFromDatabase and generate an SD format file.
However, to make this concept of using standard in and standard out
for piping data really useful, one needs to be able to control the
format of standard in and standard out similarly to the way it would
be controlled for temporary files. To facilitate this,
oemolstreams
interpret filenames
which are ONLY format extensions to indicate format control for
standard in and standard out.
Listing 8: Controlling standard in and standard out File Format
#include <openeye.h>
#include <oechem.h>
using namespace OEChem;
int main()
{
oemolistream ifs(".sdf");
oemolostream ofs(".mol2");
OEGraphMol mol;
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
Listing 8
is an example of
opening an oemolistream to standard in with the
format set to SDF. The output oemolostream is opened
to standard out with the format set to MOL2. This is exactly
equivalent to Listing 3
. However,
it demonstrates the extensibility of controlling the format of standard
in and standard out from the command line.
Note
This prevents you from naming files as pure extensions: .mol2
,
.sdf
, etc.
- Now, using our program convert from
Listing 7
it is possible to do the following:
prompt> convert .sdf .mol2
This command opens standard in with the SDF format and opens standard out with the MOL2 format.
Now there is complete format control of standard in and standard out from the command line. If we have a program GenerateStructures, which only writes the MOL2 format and another program GenerateData, which only reads the SD format, we can use them from the command line with any OEChem TK program which uses command-line arguments for file specification:
prompt> GenerateStructures | convert .mol2 .sdf | GenerateData
This command demonstrates how any OEChem TK program with command-line file specification can be used to pipe formatted input and output.
Flavored Input and Output¶
The general goal of the oemolstream
input and output classes in OEChem TK is to provide the user with
transparent access to the very complex task of reading and writing
molecules in a wide variety of formats. However, occasionally, a
programmer may want to tweak the behavior of specific readers or
writers without abandoning the
oemolstreams
to use the low level
writers (such as OEWriteMDLFile
). For these
instances, oemolstreams
provide the
oemolstreambase::SetFlavor
and
oemolstreambase::GetFlavor
methods.
The oemolstreambase::SetFlavor
method takes two
unsigned integer arguments, the first is the format for which the
flavor is being specified and the second is the flavor itself. The
formats are specified as discussed in
File Formats. The input flavors are
specified in the OEIFlavor
namespace and the
output flavors are specified in the OEOFlavor
namespace. Unlike the formats, the flavors are a bitmask and may be
binary OR’d together. Under the OEIFlavor
and
OEOFlavor
namespaces, there is a namespace for
each format as well as a OEIFlavor::Generic
namespace. The OEOFlavor::Generic
namespace is
used to control aromaticity perception and other properties common to
all formats. To completely specify a flavor, one would typically
binary-OR a OEOFlavor::Generic
flag and a format
specific flag and pass the resultant value to
oemolstreambase::SetFlavor
.
The default behavior for the PDB
reader
is that
TER
specifies the termination
of a disconnected fragment within the same molecule while
END
specifies the termination
of a connection table. However, some users may want to have the
reader split PDB
input files into
different molecules every time a
TER
appears.
The following code is an example of changing the
PDB
reader flavor.
Listing 9: Changing oemolistream Reader Flavor
#include <openeye.h>
#include <oechem.h>
using namespace OEChem;
int main()
{
OEGraphMol mol;
oemolistream ifs("input.pdb");
oemolostream ofs("output.mol2");
const unsigned int flavor = OEIFlavor::Generic::Default |
OEIFlavor::PDB::Default |
OEIFlavor::PDB::TER;
ifs.SetFlavor(OEFormat::PDB, flavor);
while (OEReadMolecule(ifs, mol))
OEWriteMolecule(ofs, mol);
return 0;
}
Similar low-level control can be exerted over both input and output
oemolstreams
using the powerful
oemolstreambase::SetFlavor
method. See the API
documentation for the associated low-level routine and namespace for
details on the effects of specific flavor flags.
File Format |
Low-Level Reader |
Input Flavors |
---|---|---|
N/A ( |
||
File Format |
Low-Level Writer |
Output Flavors |
---|---|---|
N/A ( |
||
N/A ( |
||
Note
Low-level readers will not call
Clear
on the molecule before
reading the molecule into it. This is useful for concatenating
multiple molecules together.
Molecular Property Preservation¶
Warning
OEWriteMolecule
will change your molecule!
The high-level OEWriteMolecule
function
standardizes the molecule according to the output type for uniformity.
For writing molecules without changing them, there are two options.
If you would like the data to appear in the file exactly as it is in
the molecule (perhaps Tripos atom names in a .pdb format), then you
should use a low level writer. On the other hand, if you would like
to write a standardized molecule (e.g. Tripos atom types in a MOL2
file), but don’t want your molecule changed, you can use
OEWriteConstMolecule
.
The OEWriteConstMolecule
function makes a copy of the
input molecule before performing the necessary perceptions and standardizations.