• Docs »
  • Database Preparation

Database Preparation

BROOD includes a default database of approximately six million molecular fragments that are carefully selected to be medicinally relevant. While this database is quite thorough, we recognize that some users may want to supplement the default database with fragments generated from their own proprietary molecule collections. To facilitate this process, the BROOD distribution includes two programs, CHOMP and BROODDBMERGE.

CHOMP allows users to fragment molecules, filter the fragments, generate 3D conformations, organize and index the fragments for rapid searching, and write a BROOD database. This process can be carried out from start to finish using a BROOD license and does not require access to OMEGA, FILTER, MolProp, or OEChem licenses. BROODDBMERGE, a utility that allows users to merge their own corporate databases with the OpenEye default database, will be covered in detail in the following chapter.

Theory

BROOD’s database generation program, CHOMP, fragments molecules by identifying critical bonds that can be broken. CHOMP includes three sets of bond-breaking patterns: RLF (ring, linkers & functional groups), RECAP rules [Lewell-1998], and ALL (indicating breaking all non-ring, non-resonance single bonds). By default, CHOMP breaks all bonds identified using any of these methods. Users can also specify a SMARTS file of their own bond identifiers. This file should include a series of SMARTS patterns (one per line) that each define two atoms on opposite ends of the bond to be broken. For example, a line with the SMARTS “-” will cause all single bonds to be broken, while a line with the SMARTS “[R]-[!R]” will cause all bonds between ring atoms and non-ring atoms to be broken and “[#6]-!@[!#6!#1]” will cause CHOMP to break every single non-ring bond between a carbon atom and a heteroatom.

The RLF chemical heuristics seek to break compounds into three types of primary fragments; contiguous ring systems, functional groups, and linkers. Contiguous ring systems include any set of atoms that are bonded together by at least one ring bond. Thus, fused rings and spiro rings are included as a single ring system, but biphenyl is broken into two ring systems. Functional groups are defined as any collection of bonded atoms including one or more heteroatoms or unsaturated carbons separated by at most a single fully saturated carbon atom. The linkers are the remaining saturated carbon skeletons. It should be noted that linkers, like functional groups and ring systems, can be terminal (i.e., degree 1).

CHOMP systematically identifies all the molecular fragments that can be generated by breaking one or more of the specified bonds. CHOMP first eliminates fragments that have more than 15 heavy atoms or that have more than three attachment points. CHOMP also filters the fragments based on commonly used molecular filters, eliminating unstable, reactive, or toxic functional groups. As a final fragment generation step, for every fragment with two attachment points, CHOMP generates two single attachment fragments; for every fragment with three attachment points, CHOMP generates six fragments with two attachment points and three fragments with single attachment points by capping the open attachment valence with hydrogens. CHOMP then eliminates any duplicate fragments.

CHOMP offers additional options for duplicate removal. To eliminate a set of potential fragments, such as those from a previous proprietary fragment database, you can specify the file containing those fragments with the -userUnique flag.

In many instances, users already have sets of fragments generated by their own means. CHOMP will read these fragments directly using the -userFrags flag. This flag can be used to generate a BROOD database using only these fragments or in conjunction with fragments generated using the molecular fragmentation process discussed above. In either case, BROOD will process the user-defined fragments first and process them in the order they are read from the -userFrags file. This feature is important for use in conjunction with BROOD’s -quickLook flag because those fragments will be searched first.

This first portion of the CHOMP algorithm involves only graph processing and can proceed very quickly. Please note, however, that when processing molecular databases of more than a few million compounds, CHOMP will use significant amounts of memory. Memory usage is especially aggressive early in the execution, when many new fragments are being identified, and slows somewhat as the algorithm progresses. For this reason, about 1G of memory for every million molecules you wish to process is recommended. If necessary, you can break a large molecular database into several groups, run each individually, and save their intermediate results before recombining them prior to the final stages of the CHOMP algorithm.

To write intermediate files, whether for the purpose of breaking a database into chunks or simply to investigate the fragments CHOMP is generating, a user can set the -omega flag to False. This will cause CHOMP to fragment the molecules but not generate a 3D database. The 2D output of this calculation is specified with the -out flag, but with the ISM suffix, which contains all the fragments processed so far. These molecules can be examined, filtered, sorted, joined with additional SMILES, and otherwise manipulated. The same molecules can be used with the -userFrags flag to generate a BROOD database. Other than the changes made by manipulating this intermediate output file, the databases generated by stopping at this intermediate step will be the same as one generated by executing CHOMP with the -omega flag set to True from the start.

The next stage of the CHOMP algorithm is to generate 3D conformers using a specialized version of the OMEGA algorithm (operated here with only a BROOD license). CHOMP’s version of OMEGA has a modified TORLIB in order to sample more generously around the attachment points. It also differs from default OMEGA parameters in having tighter constraints for removal of RMSD duplicates (as the measure is very sensitive to size) and for having scaled MAXCONFS and EWINDOW parameters based on the size of the fragments. Finally, CHOMP writes the fragments into a BROOD database. To accomplish this, CHOMP carries out several precalculations, including generating physical properties and adding color atoms. CHOMP also segregates the fragment conformers into groups that are likely to match the same types of queries. All these processes help BROOD search the databases more quickly.

Like the OMEGA application, the OMEGA algorithm inside CHOMP can take up to one or occasionally two seconds per fragment. For this reason, CHOMP can be run in multi-processor mode using MPI. For more details on the use of CHOMP with MPI, refer to the Open MPI section of the installation manual.

As an alternative to using OMEGA to generate conformers, CHOMP can extract the fragment conformers from a small-molecule structural database. When a small-molecule structural file is passed to CHOMP using the -readConfs flag, it signals that CHOMP should read the conformers rather than generating them. For every fragment identified by CHOMP, it seaches the structural file and identifies all instances of the fragment. For each instance, the coordinates are extracted and added as a conformer of the fragment. Finally, duplicate conformers for each fragment are removed. In this manner, fragment conformers from any source can be included in the BROOD database.

BROOD databases are currently in the format of a directory or folder that contains many files. The databases can be easily compressed and uncompressed with standard compression algorithms. They can also be moved, but not easily renamed. To rename the BROOD database, you must also rename all the files contained within it. Consistency is required to prevent confusion between the name of the database and the files being searched under that name.

Of the files in the database directory, only one is human-readable. This is the .info file. If you open the .info file with a text reader, you will see some information about the database, its version, and a manifest of the files that are supposed to be in the database. Whenever BROOD or BROODDBMERGE (see Database merging) reads a database, they check this manifest to ensure the database has not been corrupted.

After generating a database, you may want to combine the new database with other databases, such as BROOD’s default database. BROODDBMERGE accomplishes this task using three required parameters -in1, -in2, and -out. This allows combining two databases in the directories specified by the two in options and writing a new database to the directory specified by the -out flag. By default, BROODDBMERGE will not overwrite a database. For more information about merging databases, please see the next chapter.

Using CHOMP

Getting started

At its simplest, CHOMP takes molecules as input and fragments them to generate BROOD databases: simply specify the molecules you want to use to generate fragments using the -in flag and the output database you want to create with the -out flag.

By default, CHOMP assumes you want to create a database from your proprietary collection in order to augment the default database. Unlike the previous version, CHOMP no longer automatically removes any duplicates from the generated database. To remove duplicates, you can specify the database of fragments from which you would like to eliminate duplicates with the -userUnique flag.

The output from CHOMP will be a directory filled with a BROOD database. To search the database with BROOD, specify the directory name with BROOD’s -db flag.

While the basic use of CHOMP is straightforward, there are many options for controlling the CHOMP process and filtering the fragments. The full interface for CHOMP, including advanced features, will be discussed in detail below.

Help

Executing CHOMP with no arguments will result in:

prompt> chomp
Chemical Heuristic for Optimal Molecular Pieces (CHOMP).
CHOMP version 3.0.0.1, 20150619
  OEChem version 2.0.1, 20150619
  Platform: osx-10.8-g++4.2-x64
  OpenEye Scientific Software, Inc.

         Single processor
         MPI Multiprocessor

=======================================

No arguments specified on the command line
Required parameters:
    -out : Output fragment database name
For more help type:
  chomp --help

A description of the command line interface can be obtained by executing CHOMP with the --help option.

prompt> chomp --help

will generate the following output:

Simple parameter list
  Execute Options
    -param : A parameter file

  Chomp
    Input
      -in : Input molecule filename

    Output
      -out : Output fragment database name

Additional help functions:
  chomp --help simple      : Get a list of simple parameters (as seen above)
  chomp --help all         : Get a complete list of parameters
  chomp --help defaults    : List the defaults for all parameters
  chomp --help <parameter> : Get detailed help on a parameter
  chomp --help html        : Create an html help file for this program
  chomp --help versions    : List the toolkits and versions used in the application

Simple help

If you want to see all of the basic command-line options, use the --help simple flag.

prompt> chomp --help all

will generate the following output:

Chomp
  -in : Input molecule filename
  -out : Output fragment database name

This represents the simplest set of parameters and is the best place to start to learn to use CHOMP.

Complete help

If you want to see all the command-line options, use the --help all flag.

prompt> chomp --help all

will generate the following output:

Complete parameter list
  Execute Options
    -param : A parameter file
    -mpi_np : Number of MPI processes to launch.
    -mpi_hostfile : Path to hostfile to be used for launching MPI processes.

  Chomp
    Input
      -in : Input molecule filename
      -userFrags : User fragments after chomp ready for OMEGA & db prep
      -param : Control parameter file

    Output
      -out : Output fragment database name
      -prefix : Prefix for generic output files

    Options
      -dots : Write dots to the screen to follow progress
      -userUnique : User database for duplicate removal
      -smarts : SMARTS file for bonds to break (recap, both, rlf, all or file)
      -capAttach : Use the input fragment coordinates
      -flipper : Use flipper to generate up to 2^N unique samples of the
                 isomers. Set to 0 for no flipper.
      -forceFlip : Flip all stereocenters, not just those that are unspecified
      -flipN : Flip chiral nitrogens.
      -omega : Build multi-conformers in the database
      -readConfs : File of 3D molecules that can be used as a source of
                   fragment conformers
      -primaryFrag : Only return the primary fragments created & don't carry
                     out fragment combinations

    Fragment selection
      -filter : Apply filter file (true, false, or filter.txt)
      -minFrequency : Only accept fragments with freq >= minFrequency
      -minDegree : Only enumerate fragments with degree > N
      -maxDegree : Only enumerate fragments with degree <= N
      -maxMolWt : Only enumerate fragments <= maxMolWt a.u.
      -minHvy : Only enumerate fragments with > minHvy heavy atoms
      -maxHvy : Only enumerate fragments with <= maxHvy heavy atoms
      -maxChiral : Only enumerate fragments with <= maxChiral chiral centers

The defaults for each command-line parameter can be examined with the --defaults flag.

Command-line parameters

Execute parameters

-param

This flag specifies a parameter file that contains all the command-line parameters in a simple text file. The parameter file is automatically written to -prefix.param with every execution, providing a record of the input that was used. This can be used to rerun the exact same process and can also be altered by hand to modify a prior execution. If a parameter is set both in the param file and on the command line, the command line setting takes precedence. More information is available in the section on the Parameter Files below.

-mpi_np

This flag invokes CHOMP with multi-processing on the current machine. We recommend that you do not specify more than one more core than the number of cores available on your machine. For instance, if your machine has eight cores, specify at most nine MPI processes.

-mpi_hostfile

This flag invokes CHOMP with multi-processing on the current machine OR on multiple machines. Use this flag to specify a file that indicates which machines should run MPI master and slaves as well as the number of processes to run on each machine. For additional information on MPI and the format of the host file, please see section on MPI under installation.

Required parameters

-in (-i)

This flag specifies the input molecule file that contains the database of molecules you want to break up into a fragment database. If you have pregenerated fragments rather than whole molecules, please see the -userFrags flag (see below).

-out (-o)

This flag specifies the location in which to create the BROOD database. CHOMP will create a new directory using the specified flag and will fill that new directory with a series of files.

Input parameters

-userFrags

This option is similar to the -in flag except that it is for passing pregenerated fragments rather than for passing whole molecules. Using this flag skips the initial cutting apart stage; fragments are passed directly into the conformer generation and database construction phase. Fragments passed using this flag will be added to the database in the order in which they are read (no resorting occurs). It is important to note that the -in and -userFrags flags can be used together.

-userUnique

This flag allows you to specify a database you would like to use for duplicate removal. The fragments in the specified database will be determined as unique; if the fragment is seen again in the current run, those fragments will be skipped rather than incorporated into the new database.

Output parameters

-prefix

This string flag determines the prefix of the info, log, report, param, and output files. For example, if -prefix is set to foo, then the output files will include foo.info, foo.log, foo.report, and foo.param. [default = chomp]

-dots

When this flag is set, the program will write a series of dots (.) to the terminal (stdout/cout) to track the progress of the program. The dots are written in two phases: first as the molecules are cut apart and organized and second as the conformers are generated and the database is written. [default = true]

Control parameters

-smarts

This flag specifies the SMARTS file for user-defined fragmentation methods. This flag is optional, as CHOMP contains a default fragmentation algorithm. The SMARTS strings should identify two atoms that are joined by a bond that the user wants to be broken during fragmentation. Each SMARTS string should be placed on its own line in the -smarts file. SMARTS specified with this flag replace rather than augment the default fragmentation rules.

-capAttach

When a fragment is generated with multiple attachment points, this flag controls whether all the related fragments can be generated by capping one or more of the attachment points with a hydrogen. This can generate many more fragments, but is quite efficient. [default = true]

-flipper

This flag controls whether the flipper algorithm is used to generate the 2^N specific isomers by enumerating the stereo on unspecified stereocenters. The flipper flag takes an integer indicating the N in the 2^N isomers that can be generated from N stereocenters. If the stereocenters in the molecule exceed N, then 2^N random unique isomers are sampled from the available isomers. By default, no more than 32 isomers will be generated from any molecule. [default = 5]

-forceFlip

The -forceFlip flag causes all stereocenters in a ligand to be enumerated, even if they had specified stereo on the input structure. While this can generate fragments that are not specifically represented in the input molecules, it is also a means of exploring more of the chemical fragment space. [default = false]

-flipN

This flag controls whether or not nitrogen atoms with potential stereochemistry are enumerated. Some nitrogens atoms have real stereochemistry, while most others are easily invertible in solution. In fragment replacement, specific invertible static nitrogen structures are often being compared to carbon-based structures that are not invertible. It can therefore be useful to enumerate the nitrogen structures in the database even though they are interconvertible in solution. For this reason, the flag here controls whether or not nitrogen atoms with potential stereochemistry are enumerated. [default = true]

-omega

This flag controls whether the conformer generation using the OMEGA conformers is applied to the fragments. This flag has two important repercussions. First, by turning it off, the process can be stopped after the relatively quick fragmentation phase. This allows users to examine the fragments that are automatically written to an intermediate SMILES file before paying the high cost of generating conformers. Second, you can choose to replace the OMEGA algorithm with sampling conformers from known structures using the -readConfs flag.

-readConfs

This flag allows users to specify a 3D structure file from which the database generation algorithm will sample conformers. For each fragment generated in the cutting algorithm, all examples of the fragment’s conformers that are present in the molecule file specified by this flag will be added as conformers and saved in the BROOD database. Identical conformers are recognized and duplicates are removed. If you have access to a large database of small-molecule crystal structures, this flag can be used to sample conformers from that database rather than using the OMEGA algorithm.

Note

The input molecules used to generate the fragments (from the -in or -userFrags flags) do not need to be the same as the input molecules specified with this flag.

-primaryFrag

This flag indicates that CHOMP should only process the smallest possible fragments with the given fragmentation pattern. It is primarily for fragmentation applications unrelated to BROOD.

Fragment selection parameters

-filter

This is a complex flag that specifies both whether the default filter should be applied to the fragments with the values (True and False). The -filter flag can also be used to specify a filter file describing a non-default filter to be used to filter the fragments. For additional information on filtering, please see the documentation for the filter product.

-minFrequency

During the breaking of input molecules into fragments, the number of source molecules from which a fragment can be extracted is tracked. This flag indicates the minimum number of source molecules that are required in order for the fragment to be retained. The frequency in this case is normalized as a percentile where 99 indicates the most common fragments and 0 indicates the least common fragments. By default, all fragments are retained. [default = 0]

-maxDegree
-minDegree

While the fragments in the default BROOD database contain between one and three attachment points, this is a heuristic choice based on search efficiency and not a requirement. The -minDegree and -maxDegree flags specify the range of acceptable attachment points for fragments generated by CHOMP. [default = 1,3]

-maxMolWt

This is a simple flag that allows the user to specify the maximum molecular weight of any fragment generated by CHOMP. [default = 350.0]

-maxHvy
-minHvy

These flags specify the range of heavy atoms allowed in fragments generated by the CHOMP algorithm. [default = 0,15]

-maxChiral

This flag allows you to specify the maximum number of chiral centers (both atom and bond centers) in a fragment. While there are many drugs and other useful molecules with a large number of stereocenters, they are often added for a specific purpose and should not be suggested lightly in a design setting. For this reason, fragments with many chiral centers may be undesirable in some databases. This flag allows the user to efficiently eliminate these fragments. [default = 3]

Example executions

This section gives a series of examples of CHOMP command-line executions. Unlike BROOD, CHOMP does not have a GUI and must be accessed from the command line. Each example is followed by a brief description of its behavior.

prompt> chomp -in mymolecules.smi -out myfragDB
prompt> chomp -i mymolecules.smi -o myfragDB
prompt> chomp mymolecules.smi myfragDB

All three of these command lines specify exactly the same thing. In each case, CHOMP will read the molecules in mymolecules.smi and write the BROOD database in the directory myfragDB. This is the most basic and most common CHOMP execution. Since the -dots flag defaults to True, dots will be written to std:cout to indicate the progress of the molecular fragmentation as the job progresses. The myfragDB directory will be filled with just over 40 files that contain all the information critical for efficient 3D BROOD searches.

prompt> chomp -param oldrun.param

This execution of CHOMP will reads the command-line parameters from the file oldrun.param. Every time CHOMP is executed, a file called chomp.param is written that records the command-line parameters used. This is useful for recalling what was used in a specific execution or for repeating a previous calculation.

prompt> chomp -userFrags myFrags.sdf -out newDB

This reads the molecular fragments from the myFrags.sdf and, bypassing the fragmentation phase, generates 3D conformers and relates precalculated parameters, and writes the database into multiple files in the newDB directory.

prompt> chomp -in myMolecules.sdf -userUnique oldDB -out newDB

This execution of CHOMP behaves the same as the prior execution, except that every fragment generated by the fragmentation of myFrags.sdf will be compared to the fragments in the BROOD database oldDB (generated earlier). Fragments that exist in the old database will not be included in the newDB.

prompt> chomp -in myMolecules.sdf -omega false -out newDB

This reads and fragments the molecules from myMolecules.sdf, then writes all of the fragments (including source information) to the SMILES file newDB.ism. It will not generate conformers nor will it write a BROOD database. The newly generated SMILES file is available for examination and editing. After this partial run, the database preparation can be completed by passing the SMILES newDB.ism into CHOMP using the -userFrags parameter.

prompt> chomp -in myMolecules.sdf -readConfs xtals.sdf -out newDB

This execution of CHOMP reads and fragments the molecules from myMolecules.sdf. Then, rather than using OMEGA to generate conformers for the database, CHOMP uses the conformers read from the structures in xtals.sdf to generate conformers for the fragments. These conformers will be written into the new BROOD database newDB.

prompt> chomp -smarts bondids mymolecules.smi myfrags.oeb.gz
prompt> chomp mymolecules.smi myfrags.oeb.gz -smarts bondids

This example demonstrates two important principles. The first is that the first of these two command lines will work, but the second will result in an error. When specifying a command line with keyless arguments for the -in and -out files, these files must be the final two arguments on the command line.

The second principle involves use of non-default fragmentation patterns. In this example, CHOMP uses the SMARTS patterns in the file bondids to generate fragments rather than the default fragmentation scheme.

prompt> chomp -mpi_np 16 -in mymolecules.sdf -out newDB

In this execution, the mymolecules.sdf is read, the molecules are fragmented, 3D conformers are generated and the database is written to the directory newDB. The fragmentation and the more expensive conformer generation will be carried out on 16 local processors. On modern multicore workstations, this parameter is often a highly efficient means to execute CHOMP.