Molecule Reading and Writing

Molecule Streams

The previous section demonstrated reading and writing SMILES strings from standard input. This required the programmer to perform the file I/O explicitly. Whilst this may be reasonable for SMILES strings that may be read via STD getline, it isn’t suitable for more complex file formats. To ease this task, OEChem TK provides the oemolstream abstraction. The classes oemolistream and oemolostream allow input and output of molecules using the OEReadMolecule and OEWriteMolecule functions, respectively.

Listing 1: High-level Molecule I/O using molstreams

from openeye import oechem

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

mol = oechem.OEGraphMol()

while oechem.OEReadMolecule(ifs, mol):
    oechem.OEWriteMolecule(ofs, mol)

Listing 1 will read molecules from stdin in SMILES format and write them to stdout in absolute SMILES format. Notice that in this example, the following is done automatically for you by OEReadMolecule:

Note

OEReadMolecule will skip any invalid molecules automatically.

OEChem TK provides the following more Python-like generator methods to read molecules out of an oemolistream if the user prefers. It is important to realize that the same OEGraphMol instance is reused for every molecule being read from the oemolistream.

Generator methods for reading molecules

from openeye import oechem

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

for mol in ifs.GetOEGraphMols():
    oechem.OEWriteMolecule(ofs, mol)

Since the same OEGraphMol instance is being reused for every molecule in the stream, special precautions must be taken to load up an entire file into memory. Listing 2 will read molecules into memory by making sure there is a copy made when inserting the molecule into the container. This is useful if the molecules need to be repeatedly accessed.

Listing 2: Reading molecules into memory

from openeye import oechem

ifs = oechem.oemolistream()
mollist = []

for mol in ifs.GetOEGraphMols():
    mollist.append(oechem.OEGraphMol(mol))

File Formats

In addition to SMILES strings, OEChem TK is able to read numerous other molecular file formats, including MDL SD files, Tripos Tripos MOL2 files and PDB files. The format of an input file or stream may be associated with an oemolistream using the SetFormat method, and may be retrieved with GetFormat. The file format is represented by an unsigned integer, which should be one of the constants defined in the the OEFormat namespace. A value of OEFormat_UNDEFINED means that there is no file format associated with the oemolstream.

Note

The default format associated with an oemolstream is OEFormat_SMI.

Listing 3 demonstrates how to use oemolstreams to convert MDL SD files into Tripos MOL2 files.

Listing 3: Explicitly setting file formats

from openeye import oechem

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

ifs.SetFormat(oechem.OEFormat_SDF)
ofs.SetFormat(oechem.OEFormat_MOL2)

for mol in ifs.GetOEGraphMols():
    oechem.OEWriteMolecule(ofs, mol)
OEChem TK file format support

File Format

Description

Read

Write

OEFormat_CAN

Canonical SMILES

Yes

Yes

OEFormat_CDX

ChemDraw CDX

Yes

Yes

OEFormat_CIF

Crystallographic Information File

Yes

Yes

OEFormat_CSV

Comma separated values

Yes

Yes

OEFormat_CXSMILES

Chemaxon Extended SMILES

Yes

Yes

OEFormat_FASTA

FASTA protein sequence

Yes

Yes

OEFormat_INCHI

IUPAC InChI

No

Yes

OEFormat_INCHIKEY

IUPAC InChI Key

No

Yes

OEFormat_ISM

Canonical isomeric SMILES

Yes

Yes

OEFormat_MDL

MDL Mol File

Yes

Yes

OEFormat_MF

Molecular Formula (Hill order)

No

Yes

OEFormat_MMCIF

Macromolecular Crystallographic Information File

Yes

Yes

OEFormat_MMOD

Macromodel

Yes

Yes

OEFormat_MOL2

Tripos Sybyl mol2 file

Yes

Yes

OEFormat_MOL2H

Sybyl mol2 with explicit hydrogens

Yes

Yes

OEFormat_MOPAC

MOPAC file format(s)

No

Yes

OEFormat_OEB

OpenEye OEBinary

Yes

Yes

OEFormat_OEZ

Zstd Compressed OpenEye OEBinary

Yes

Yes

OEFormat_PDB

Protein Databank PDB file

Yes

Yes

OEFormat_RDF

MDL RDF File

Yes

No

OEFormat_SDF

MDL SD File

Yes

Yes

OEFormat_SKC

MDL ISIS Sketch File

Yes

No

OEFormat_SLN

Tripos Sybyl Line Notation

No

Yes

OEFormat_SMI

Canonical isomeric SMILES

Yes

Yes

OEFormat_USM

SMILES

Yes

Yes

OEFormat_XYZ

XMol XYZ format

Yes

Yes

File Input and Output

In addition to standard in and standard out, OEChem TK’s oemolstreams also support reading from files. To open a file, use the open method with the required filename. For input oemolistreams, if the file does not exist the open fails and returns false, for output oemolostreams, the output file is created if it did not previously exist and overwritten if it did. If a filename is passed to the open method, the format and gz compression flags are set based on the file extension of the filename for the oemolistream or oemolostream instance. If no arguments are passed to open, an oemolistream will use standard in, and an oemolostream will use standard out, the format will be set to OEFormat_Default and gz compression will be disabled. Much like regular file I/O, oemolstreams can be closed after use, using the close method. oemolstreams are automatically closed by their destructors.

Listing 4: Reading and writing molecule files

from openeye import oechem

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

if ifs.open("input.sdf"):
    if ofs.open("output.mol2"):
        for mol in ifs.GetOEGraphMols():
            oechem.OEWriteMolecule(ofs, mol)
    else:
        oechem.OEThrow.Fatal("Unable to create 'output.mol2'")
else:
    oechem.OEThrow.Fatal("Unable to open 'input.sdf'")

One convenient feature of the open method of oemolstreams is that it sets the file format and gz compression associated with the stream from the file extension of the filename used as an argument. Listing 4 converts the file ‘input.sdf’ in MDL file format into the file ‘output.mol2’ in Tripos MOL2 file format. This behavior can be overridden by calling SetFormat after the open but before the first molecule is read or written to a stream.

String Input and Output

The oemolistream and oemolostream provide the ability to read and write memory buffers instead of files.

This is accomplished through the methods, oemolistream.openstring and oemolostream.GetString.

The Listing 5 example below demonstrates how to read and write molecules from memory buffers.

Listing 5: Reading and writing molecule from memory buffers

from openeye import oechem


smiles = '''\
CCO
c1cnccc1'''

ims = oechem.oemolistream()
ims.SetFormat(oechem.OEFormat_SMI)
ims.openstring(smiles)

mols = []
mol = oechem.OEMol()
for mol in ims.GetOEMols():
    mols.append(oechem.OEMol(mol))

oms = oechem.oemolostream()
oms.SetFormat(oechem.OEFormat_SDF)
oms.openstring()

for mol in mols:
    oechem.OEWriteMolecule(oms, mol)

molfile = oms.GetString()
print("MOL string\n", molfile.decode('UTF-8'))

Compressed Input and Output

For any of the molecular file formats supported by OEChem TK, it is often convenient to read and write compressed files or strings. Molecule streams support gzipped input and output via the zlib library. The .gz suffix on any filename used to open a stream is recognized and the stream is read or written in compressed format. This mechanism does not interfere with the format perception. For instance, foo.sdf.gz is recognized as a gzipped file in MDL’s SD format.

The Listing 6 example below converts all of the molecules in a gzipped SD format file into a gzipped OEBinary file.

Listing 6: Reading and writing compressed molecule files

from openeye import oechem

ifs = oechem.oemolistream("input.sdf.gz")
ofs = oechem.oemolostream("output.oeb.gz")

for mol in ifs.GetOEGraphMols():
    oechem.OEWriteMolecule(ofs, mol)

The OEChem TK’s binary file format OEFormat_OEB is well suited (and our preferred way) to store multi-conformer molecules. When working with multi-conformer molecules of +1M datasets, the generated files can easily reach a size when transferring them between computers or networks can be slow. One way to handle this issue is to compress the files with gzip before transfer. As the above example shows OEChem TK seamlessly handles gzipped molecule files with gz extension. Alternatively, OEChem TK also provides a native compressed file format, called OEFormat_OEZ, with the oez file extension. This file format currently only supports multi-conformer molecules derived from the OEMCMolBase class.

One of the main differences between the gzipped OEFormat_OEB and the OEFormat_OEZ files that while gzip compresses a molecular file as a whole, in an OEFormat_OEZ file, each molecule is individually compressed using the Zstandard compression algorithm.

This difference makes OEFormat_OEZ suitable to work really well with OEMolDatabase that is designed to provide fast read-only random access to molecular files. When an OEMolDatabase opens a gzipped molecule file, uncompressing the whole file is necessary to access any molecule in the dataset. When working with a OEFormat_OEZ file, accessing a specific molecule in the file only triggers the un-compression of the associated data chunk.

The table below lists the file sizes of the same dataset of 1M molecules in uncompressed and compressed formats with different numbers of conformations per molecule.

Dataset

oeb

oeb.gz

oez

1M molecules with 10 conformations

5.4GB

1.9GB

2.2GB

1M molecules with 50 conformations

22GB

5.7GB

5.5GB

1M molecules with 200 conformations

69GB

16GB

13GB

Note

The OEFormat_OEZ file format was introduced in the 2019.Apr toolkits. OpenEye applications or toolkits released before Apr/2019 will not be able to read or write this new file format.

Command Line Format Control

Using the methods outlined above, it is possible to allow the stream format to be controlled from the command line. OEChem TK’s oemolstreams control the format by interpreting the input and output file names.

Listing 7: Controlling File Format from the Command Line

from openeye import oechem
import sys

if len(sys.argv) != 3:
    oechem.OEThrow.Usage("%s <input> <output>" % sys.argv[0])

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

if not ifs.open(sys.argv[1]):
    oechem.OEThrow.Fatal("Unable to open %s" % sys.argv[1])

if not ofs.open(sys.argv[2]):
    oechem.OEThrow.Fatal("Unable to create %s" % sys.argv[2])

for mol in ifs.GetOEGraphMols():
    oechem.OEWriteMolecule(ofs, mol)

Listing 7 is an example of using command-line arguments to allow OEChem TK programs to support many file formats at run-time. For instance, if Listing 7 is a program called convert:

prompt> convert file1.sdf file1.smi

This command will convert file1.sdf from SD format to SMILES format. A first extension of this idea allows access to standard in and standard out via the “-” filename. For instance:

prompt> convert file2.mol2 -

This command will convert file2.mol2 in MOL2 format and write the molecules to standard out in SMILES, the default format.

Thus if you have another program, GetFromDatabase, which retrieves molecules from a database and writes them in SMILES format, you can chain it with any OEChem TK program. Using your operating systems redirection commands (e.g. - Unix pipe “|” or redirect “>”) you can move molecules directly from GetFromDatabase to convert without a temporary file:

prompt> GetFromDatabase | convert - file3.sdf

This command will take the SMILES format output from GetFromDatabase and generate an SD format file.

However, to make this concept of using standard in and standard out for piping data really useful, one needs to be able to control the format of standard in and standard out similarly to the way it would be controlled for temporary files. To facilitate this, oemolstreams interpret filenames which are ONLY format extensions to indicate format control for standard in and standard out.

Listing 8: Controlling standard in and standard out File Format

from openeye import oechem

ifs = oechem.oemolistream(".sdf")
ofs = oechem.oemolostream(".mol2")

for mol in ifs.GetOEGraphMols():
    oechem.OEWriteMolecule(ofs, mol)

Listing 8 is an example of opening an oemolistream to standard in with the format set to SDF. The output oemolostream is opened to standard out with the format set to MOL2. This is exactly equivalent to Listing 3. However, it demonstrates the extensibility of controlling the format of standard in and standard out from the command line.

Note

This prevents you from naming files as pure extensions: .mol2, .sdf, etc.

Now, using our program convert from Listing 7

it is possible to do the following:

prompt> convert .sdf .mol2

This command opens standard in with the SDF format and opens standard out with the MOL2 format.

Now there is complete format control of standard in and standard out from the command line. If we have a program GenerateStructures, which only writes the MOL2 format and another program GenerateData, which only reads the SD format, we can use them from the command line with any OEChem TK program which uses command-line arguments for file specification:

prompt> GenerateStructures | convert .mol2 .sdf | GenerateData

This command demonstrates how any OEChem TK program with command-line file specification can be used to pipe formatted input and output.

Flavored Input and Output

The general goal of the oemolstream input and output classes in OEChem TK is to provide the user with transparent access to the very complex task of reading and writing molecules in a wide variety of formats. However, occasionally, a programmer may want to tweak the behavior of specific readers or writers without abandoning the oemolstreams to use the low level writers (such as OEWriteMDLFile). For these instances, oemolstreams provide the oemolstreambase.SetFlavor and oemolstreambase.GetFlavor methods.

The oemolstreambase.SetFlavor method takes two unsigned integer arguments, the first is the format for which the flavor is being specified and the second is the flavor itself. The formats are specified as discussed in File Formats. The input flavors are specified in the OEIFlavor namespace and the output flavors are specified in the OEOFlavor namespace. Unlike the formats, the flavors are a bitmask and may be binary OR’d together. Under the OEIFlavor and OEOFlavor namespaces, there is a namespace for each format as well as a OEIFlavor_Generic namespace. The OEOFlavor_Generic namespace is used to control aromaticity perception and other properties common to all formats. To completely specify a flavor, one would typically binary-OR a OEOFlavor_Generic flag and a format specific flag and pass the resultant value to oemolstreambase.SetFlavor.

The default behavior for the PDB reader is that TER specifies the termination of a disconnected fragment within the same molecule while END specifies the termination of a connection table. However, some users may want to have the reader split PDB input files into different molecules every time a TER appears.

The following code is an example of changing the PDB reader flavor.

Listing 9: Changing oemolistream Reader Flavor

from openeye import oechem

ifs = oechem.oemolistream('input.pdb')
ofs = oechem.oemolostream('output.mol2')

flavor = oechem.OEIFlavor_Generic_Default | oechem.OEIFlavor_PDB_Default | oechem.OEIFlavor_PDB_TER
ifs.SetFlavor(oechem.OEFormat_PDB, flavor)

for mol in ifs.GetOEGraphMols():
    oechem.OEWriteMolecule(ofs, mol)

Similar low-level control can be exerted over both input and output oemolstreams using the powerful oemolstreambase.SetFlavor method. See the API documentation for the associated low-level routine and namespace for details on the effects of specific flavor flags.

OEChem TK low-level readers and input flavors

File Format

Low-Level Reader

Input Flavors

OEFormat_CAN

OEParseSmiles

OEIFlavor_CAN

OEFormat_CDX

OEReadCDXFile

OEIFlavor_CDX

OEFormat_CIF

OEReadCIFFile

OEIFlavor_CIF

OEFormat_CSV

OEReadCSVFile

OEIFlavor_CSV

OEFormat_CXSMILES

OEIFlavor_CXSMILES

OEFormat_FASTA

OEReadFASTAFile

OEIFlavor_FASTA

OEFormat_ISM

OEParseSmiles

OEIFlavor_ISM

OEFormat_MDL

OEReadMDLFile

OEIFlavor_MDL

OEFormat_MF

OEFormat_MMCIF

OEReadCIFFile

OEIFlavor_MMCIF

OEFormat_MMOD

OEReadMacroModelFile

OEIFlavor_MMOD

OEFormat_MOL2

OEReadMol2File

OEIFlavor_MOL2

OEFormat_MOL2H

OEReadMol2File

OEIFlavor_MOL2H

OEFormat_MOPAC

OEFormat_OEB

OEReadOEBFile

OEFormat_OEZ

N/A (OEReadMolecule works)

OEFormat_PDB

OEReadPDBFile

OEIFlavor_PDB

OEFormat_SDF

OEReadMDLFile

OEIFlavor_SDF

OEFormat_SKC

OEReadSketchFile

OEIFlavor_SKC

OEFormat_SLN

OEFormat_SMI

OEParseSmiles

OEIFlavor_SMI

OEFormat_XYZ

OEReadXYZFile

OEIFlavor_XYZ

OEChem TK low-level writers and output flavors

File Format

Low-Level Writer

Output Flavors

OEFormat_CAN

OECreateCanSmiString

OEOFlavor_CAN

OEFormat_CIF

OEWriteCIFFile

OEOFlavor_CIF

OEFormat_CDX

OEWriteCDXFile

OEOFlavor_CDX

OEFormat_CSV

OEWriteCSVFile

OEOFlavor_CSV

OEFormat_CXSMILES

OEOFlavor_CXSMILES

OEFormat_FASTA

OEWriteFASTAFile

OEOFlavor_FASTA

OEFormat_ISM

OECreateIsoSmiString

OEOFlavor_ISM

OEFormat_MDL

OEWriteMDLFile

OEOFlavor_MDL

OEFormat_MF

OEMolecularFormula

OEOFlavor_MF

OEFormat_MMCIF

OEWriteCIFFile

OEOFlavor_MMCIF

OEFormat_MMOD

OEWriteMacroModelFile

OEOFlavor_MMOD

OEFormat_MOL2

OEWriteMol2File

OEOFlavor_MOL2

OEFormat_MOL2H

OEWriteMol2File

OEOFlavor_MOL2H

OEFormat_MOPAC

OEWriteMOPACInputFile

OEOFlavor_MOPAC

OEFormat_OEB

N/A (OEWriteMolecule works)

OEFormat_OEZ

N/A (OEWriteMolecule works)

OEFormat_PDB

OEWritePDBFile

OEOFlavor_PDB

OEFormat_SDF

OEWriteMDLFile

OEOFlavor_SDF

OEFormat_SKC

OEFormat_SLN

OECreateSlnString

OEOFlavor_SLN

OEFormat_SMI

OECreateSmiString

OEOFlavor_SMI

OEFormat_XYZ

OEWriteXYZFile

OEOFlavor_XYZ

OEFormat_INCHI

OECreateInChI

OEOFlavor_INCHI

OEFormat_INCHIKEY

OECreateInChIKey

OEOFlavor_INCHIKEY

Note

Low-level readers will not call Clear on the molecule before reading the molecule into it. This is useful for concatenating multiple molecules together.

Molecular Property Preservation

Warning

OEWriteMolecule will change your molecule!

The high-level OEWriteMolecule function standardizes the molecule according to the output type for uniformity. For writing molecules without changing them, there are two options. If you would like the data to appear in the file exactly as it is in the molecule (perhaps Tripos atom names in a .pdb format), then you should use a low level writer. On the other hand, if you would like to write a standardized molecule (e.g. Tripos atom types in a MOL2 file), but don’t want your molecule changed, you can use OEWriteConstMolecule. The OEWriteConstMolecule function makes a copy of the input molecule before performing the necessary perceptions and standardizations.