Molecule Reading and Writing

Molecule Streams

The previous section demonstrated reading and writing SMILES strings from a file. This required the programmer to perform the file I/O explicitly. Whilst this may be reasonable for SMILES strings that may be read via STL‘s getline, it isn’t suitable for more complex file formats. To ease this task, OEChem provides the oemolstream abstraction. The classes oemolistream and oemolostream allow input and output of molecules using the OEReadMolecule and OEWriteMolecule functions, respectively.

Warning

The oemolstreambase.close method must be explicitly called on all molecule streams in Java. Java does not guarantee that finalizers are ever called. There is also a delay before a finalizer is called, possibly causing the operating system to run out of resources, e.g., file handles.

Listing 1: High-level Molecule I/O using molstreams

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class oemolstream {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream();
        oemolostream ofs = new oemolostream();

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol))
            oechem.OEWriteMolecule(ofs, mol);
        ofs.close();
        ifs.close();
    }
}

Listing 1 will read molecules from stdin in SMILES format and write them to stdout in absolute SMILES format. Notice that in this example, the following is done automatically for you by OEReadMolecule:

Note

OEReadMolecule will skip any invalid molecules automatically.

Since the same OEGraphMol instance is being reused for every molecule in the stream, special precautions must be taken to load up an entire file into memory. Listing 2 will read molecules into memory by making sure there is a copy made when inserting the molecule into the container. This is useful if the molecules need to be repeatedly accessed.

Listing 2: Reading molecules into memory

package openeye.docexamples.oechem;

import java.util.ArrayList;
import openeye.oechem.*;

public class ReadMolIntoMemory {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream();
        ArrayList<OEGraphMol> mollist = new ArrayList<OEGraphMol>();

        // read molecules and put copies into list
        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol)) {
            System.out.println(mol.GetTitle());
            mollist.add(new OEGraphMol(mol));
        }
        ifs.close();
        System.out.println(mollist.size()+" molecules loaded.");
    }
}

File Formats

In addition to SMILES strings, OEChem is able to read numerous other molecular file formats, including MDL SD files, Tripos Tripos MOL2 files and PDB files. The format of an input file or stream may be associated with an oemolistream using the SetFormat method, and may be retrieved with GetFormat. The file format is represented by an unsigned integer, which should be one of the constants defined in the OEFormat namespace. A value of OEFormat.UNDEFINED means that there is no file format associated with the oemolstream.

Note

The default format associated with an oemolstream is OEFormat.SMI.

Listing 3 demonstrates how to use oemolstreams to convert MDL SD files into Tripos MOL2 files.

Listing 3: Explicitly setting file formats

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class ReadWriteSettingFormats {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream();
        oemolostream ofs = new oemolostream();

        ifs.SetFormat(OEFormat.SDF);
        ofs.SetFormat(OEFormat.MOL2);

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol)) {
            oechem.OEWriteMolecule(ofs, mol);
        }
        ofs.close();
        ifs.close();
    }
}
OEChem file format support
File Format Description Read Write
OEFormat.CAN Canonical SMILES Yes Yes
OEFormat.CDX ChemDraw CDX Yes Yes
OEFormat.CSV Comma separated values Yes Yes
OEFormat.FASTA FASTA protein sequence Yes Yes
OEFormat.INCHI IUPAC InChI No Yes
OEFormat.INCHIKEY IUPAC InChI Key No Yes
OEFormat.ISM Canonical isomeric SMILES Yes Yes
OEFormat.MDL MDL Mol File Yes Yes
OEFormat.MF Molecular Formula (Hill order) No Yes
OEFormat.MMOD Macromodel Yes Yes
OEFormat.MOL2 Tripos Sybyl mol2 file Yes Yes
OEFormat.MOL2H Sybyl mol2 with explicit hydrogens Yes Yes
OEFormat.MOPAC MOPAC file format(s) No Yes
OEFormat.OEB OpenEye OEBinary Yes Yes
OEFormat.PDB Protein Databank PDB file Yes Yes
OEFormat.RDF MDL RDF File Yes No
OEFormat.SDF MDL SD File Yes Yes
OEFormat.SKC MDL ISIS Sketch File Yes No
OEFormat.SLN Tripos Sybyl Line Notation No Yes
OEFormat.SMI Canonical isomeric SMILES Yes Yes
OEFormat.USM SMILES Yes Yes
OEFormat.XYZ XMol XYZ format Yes Yes

File Input and Output

In addition to standard in and standard out, OEChem‘s oemolstreams also support reading from files. To open a file, use the open method with the required filename. For input oemolistreams, if the file does not exist the open fails and returns false, for output oemolostreams, the output file is created if it did not previously exist and overwritten if it did. If no arguments are passed to open, an oemolistream will use standard in, and an oemolostream will use standard out. Much like regular file I/O, oemolstreams can be closed after use, using the close method. oemolstreams are automatically closed by their destructors.

Listing 4: Reading and writing molecule files

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class ReadWriteToFiles {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream();
        oemolostream ofs = new oemolostream();

        if (!ifs.open("input.sdf"))
            oechem.OEThrow.Fatal("Unable to open 'input.sdf'");

        if (!ofs.open("output.mol2"))
            oechem.OEThrow.Fatal("Unable to create 'output.mol2'");

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol)) 
            oechem.OEWriteMolecule(ofs, mol);
        ofs.close();
        ifs.close();
    }
}

One convenient feature of the open method of oemolstreams is that it sets the file format associated with the stream from the file extension of the filename used as an argument. Listing 4 converts the file ‘input.sdf’ in MDL file format into the file ‘output.mol2’ in Tripos MOL2 file format. This behavior can be overridden by calling SetFormat after the open but before the first molecule is read or written to a stream.

String Input and Output

The oemolistream and oemolostream provide the ability to read and write memory buffers instead of files. This is accomplished through the openstring methods, oemolistream.openstring and oemolostream.openstring.

Compressed Input and Output

For any of the molecular file formats supported by OEChem, it is often convenient to read and write compressed files or strings. Molecule streams support gzipped input and output via the zlib library. The ”.gz” suffix on any filename used to open a stream is recognized and the stream is read or written in compressed format. This mechanism does not interfere with the format perception. For instance, “fn.sdf.gz” is recognized as a gzipped file in MDL’s SD format.

Listing 5: Reading and writing compressed molecule files

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class ReadWriteCompressedFiles {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream("input.sdf.gz");
        oemolostream ofs = new oemolostream("output.oeb.gz");

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol))
            oechem.OEWriteMolecule(ofs, mol);
        ofs.close();
        ifs.close();
    }
}

Listing 5 converts all of the molecules in a gzipped SD format file into a gzipped OEBinary version 2 format file.

Command Line Format Control

Using the methods outlined above, it is possible to allow the stream format to be controlled from the command line. OEChem‘s oemolstreams control the format by interpreting the input and output file names.

Listing 6: Controlling File Format from the Command Line

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class ReadWriteCommandLineArgs {
    public static void main(String argv[]) {
        if (argv.length != 2)
            oechem.OEThrow.Usage("ReadWriteCommandLineArgs <input> <output>");

        oemolistream ifs = new oemolistream();
        oemolostream ofs = new oemolostream();

        if (!ifs.open(argv[0]))
            oechem.OEThrow.Fatal("Unable to open " + argv[0]);

        if (!ofs.open(argv[1]))
            oechem.OEThrow.Fatal("Unable to create " + argv[1]);

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol)) 
            oechem.OEWriteMolecule(ofs, mol);
        ofs.close();
        ifs.close();
    }
}

Listing 6 is an example of using command-line arguments to allow OEChem programs to support many file formats at run-time. For instance, if Listing 6 is a program called convert:

prompt> convert file1.sdf file1.smi

This command will convert file1.sdf from SD format to SMILES format. A first extension of this idea allows access to standard in and standard out via the “-” filename. For instance:

prompt> convert file2.mol2 -

This command will convert file2.mol2 in MOL2 format and write the molecules to standard out in SMILES, the default format.

Thus if you have another program, GetFromDatabase, which retrieves molecules from a database and writes them in SMILES format, you can chain it with any OEChem program. Using your operating systems redirection commands (e.g. - Unix pipe “|” or redirect “>”) you can move molecules directly from GetFromDatabase to convert without a temporary file:

prompt> GetFromDatabase | convert - file3.sdf

This command will take the SMILES format output from GetFromDatabase and generate an SD format file.

However, to make this concept of using standard in and standard out for piping data really useful, one needs to be able to control the format of standard in and standard out similarly to the way it would be controlled for temporary files. To facilitate this, oemolstreams interpret filenames which are ONLY format extensions to indicate format control for standard in and standard out.

Listing 7: Controlling standard in and standard out File Format

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class ReadWriteStdinStdoutFormatC {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream(".sdf");
        oemolostream ofs = new oemolostream(".mol2");

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol)) {
            oechem.OEWriteMolecule(ofs, mol);
        }
        ofs.close();
        ifs.close();
    }
}

Listing 6 is an example of opening an oemolistream to standard in with the format set to SDF. The output oemolostream is opened to standard out with the format set to MOL2. This is exactly equivalent to Listing 3. However, it demonstrates the extensibility of controlling the format of standard in and standard out from the command line.

Note

This prevents you from naming files as pure extensions: ”.mol2”, ”.sdf”, etc.

Now, using our program convert from Listing 6 it is possible to do the following:

prompt> convert .sdf .mol2

This command opens standard in with the SDF format and opens standard out with the MOL2 format.

Now there is complete format control of standard in and standard out from the command line. If we have a program GenerateStructures, which only writes the MOL2 format and another program GenerateData, which only reads the SD format, we can use them from the command line with any OEChem program which uses command-line arguments for file specification:

prompt> GenerateStructures | convert .mol2 .sdf | GenerateData

This command demonstrates how any OEChem program with command-line file specification can be used to pipe formatted input and output.

Flavored Input and Output

The general goal of the oemolstream input and output classes in OEChem is to provide the user with transparent access to the very complex task of reading and writing molecules in a wide variety of formats. However, occasionally, a programmer may want to tweak the behavior of specific readers or writers without abandoning the oemolstreams to use the low level writers (such as OEWriteMDLFile). For these instances, oemolstreams provide the oemolstreambase.SetFlavor and oemolstreambase.GetFlavor methods.

The oemolstreambase.SetFlavor method takes two unsigned integer arguments, the first is the format for which the flavor is being specified and the second is the flavor itself. The formats are specified as discussed in File Formats. The input flavors are specified in the OEIFlavor namespace and the output flavors are specified in the OEOFlavor namespace. Unlike the formats, the flavors are a bitmask and may be binary OR’d together. Under the OEIFlavor and OEOFlavor namespaces, there is a namespace for each format as well as a OEIFlavor.Generic namespace. The OEOFlavor.Generic namespace is used to control aromaticity perception and other properties common to all of formats. To completely specify a flavor, one would typically binary-OR a OEOFlavor.Generic flag and a format specific flag and pass the resultant value to oemolstreambase.SetFlavor.

The default behavior for the PDB reader is that TER specifies the termination of a disconnected fragment within the same molecule while END specifies the termination of a connection table. However, some users may want to have the reader split PDB input files into different molecules every time a TER appears.

The following code is an example of changing the PDB reader flavor.

Listing 8: Changing oemolistream Reader Flavor

package openeye.docexamples.oechem;

import openeye.oechem.*;

public class ReadWriteFlavors {
    public static void main(String argv[]) {
        oemolistream ifs = new oemolistream("input.pdb");
        oemolostream ofs = new oemolostream("output.mol2");
        int flavor = 
                OEIFlavor.Generic.Default|OEIFlavor.PDB.Default|OEIFlavor.PDB.TER;
        ifs.SetFlavor(OEFormat.PDB, flavor);

        OEGraphMol mol = new OEGraphMol();
        while (oechem.OEReadMolecule(ifs, mol)) {
            oechem.OEWriteMolecule(ofs, mol);
        }
        ofs.close();
        ifs.close();
    }
}

Similar low-level control can be exerted over both input and output oemolstreams using the powerful oemolstreambase.SetFlavor method. See the API documentation for the associated low-level routine and namespace for details on the effects of specific flavor flags.

OEChem low-level readers and input flavors
File Format Low-Level Reader Input Flavors
OEFormat.CAN OEParseSmiles OEIFlavor.CAN
OEFormat.CDX OEReadCDXFile OEIFlavor.CDX
OEFormat.CSV OEReadCSVFile OEIFlavor.CSV
OEFormat.FASTA OEReadFASTAFile OEIFlavor.FASTA
OEFormat.ISM OEParseSmiles OEIFlavor.ISM
OEFormat.MDL OEReadMDLFile OEIFlavor.MDL
OEFormat.MF    
OEFormat.MMOD OEReadMacroModelFile OEIFlavor.MMOD
OEFormat.MOL2 OEReadMol2File OEIFlavor.MOL2
OEFormat.MOL2H OEReadMol2File OEIFlavor.MOL2H
OEFormat.MOPAC    
OEFormat.OEB OEReadOEBFile  
OEFormat.PDB OEReadPDBFile OEIFlavor.PDB
OEFormat.SDF OEReadMDLFile OEIFlavor.SDF
OEFormat.SKC OEReadSketchFile OEIFlavor.SKC
OEFormat.SLN    
OEFormat.SMI OEParseSmiles OEIFlavor.SMI
OEFormat.XYZ OEReadXYZFile OEIFlavor.XYZ
OEChem low-level writers and output flavors
File Format Low-Level Writer Output Flavors
OEFormat.CAN OECreateCanSmiString OEOFlavor.CAN
OEFormat.CDX OEWriteCDXFile OEOFlavor.CDX
OEFormat.CSV OEWriteCSVFile OEOFlavor.CSV
OEFormat.FASTA OEWriteFASTAFile OEOFlavor.FASTA
OEFormat.ISM OECreateIsoSmiString OEOFlavor.ISM
OEFormat.MDL OEWriteMDLFile OEOFlavor.MDL
OEFormat.MF OEMolecularFormula OEOFlavor.MF
OEFormat.MMOD OEWriteMacroModelFile OEOFlavor.MMOD
OEFormat.MOL2 OEWriteMol2File OEOFlavor.MOL2
OEFormat.MOL2H OEWriteMol2File OEOFlavor.MOL2H
OEFormat.MOPAC OEWriteMOPACInputFile OEOFlavor.MOPAC
OEFormat.OEB N/A (OEWriteMolecule works)  
OEFormat.PDB OEWritePDBFile OEOFlavor.PDB
OEFormat.SDF OEWriteMDLFile OEOFlavor.SDF
OEFormat.SKC    
OEFormat.SLN OECreateSlnString OEOFlavor.SLN
OEFormat.SMI OECreateSmiString OEOFlavor.SMI
OEFormat.XYZ OEWriteXYZFile OEOFlavor.XYZ
OEFormat.INCHI OECreateInChI OEOFlavor.INCHI
OEFormat.INCHIKEY OECreateInChIKey OEOFlavor.INCHIKEY

Note

Low-level readers will not call Clear on the molecule before reading the molecule into it. This is useful for concatenating multiple molecules together.

Molecular Property Preservation

Warning

OEWriteMolecule will change your molecule!

The high-level OEWriteMolecule function standardizes the molecule according to the output type for uniformity. For writing molecules without changing them, there are two options. If you would like the data to appear in the file exactly as it is in the molecule (perhaps Tripos atom names in a .pdb format), then you should use a low level writer. On the other hand, if you would like to write a standardized molecule (e.g. Tripos atom types in a MOL2 file), but don’t want your molecule changed, you can use OEWriteConstMolecule.