There is a long history to fragment and bioisosteric replacement (see [Chen-2003]). Most medicinal chemists are well versed in standard sets of bioisosteric fragments. Likewise, there is a long history of computational approaches to fragment replacement (see [Verloop-1987] and [Bartlett-1994]). There have been several attempts to examine sets of known active compounds to empirically identify bioisosteric fragments (see [Ujvary-2003] and [Sheridan-2002]). While this is an interesting exercise, it has two drawbacks. First, it can only identify bioisosteric fragment pairs that are already known. While these provide interesting study, they are often already familiar to experienced medicinal chemists and modelers. Second, it identifies many incidental rather than meaningful fragment pairs. These result from the fact that simply because two molecules bind to the same site does not mean they differ only by bioisosteric replacement. For instance, chemists may analog a compound by substituting an N-methyl group with an N-benzyl group in order to identify new binding pockets. However, just because both of these compounds are bioactive does not mean that methyl and benzyl are similar fragments (though they would be identified as such by some methods). While one may apply various heuristics, such as size, to avoid this problem, we hope to explore methods that are more robust.
An alternative approach has been to use an algorithm that would predict whether two fragments are similar in relevant ways. Several groups including [Bartlett-1994], [Verloop-1987] and [Willett-2001] have developed methods in this area. Here we seek to capitalize on and extend the ideas developed by these workers.
Fragment similarity searching¶
BROOD allows users to enter a single query fragment and search a very large database of known molecular fragments in order to identify fragments that are similar. Each database fragment is compared to the query fragment in 3D with regard to shape, chemistry, electrostatics, and geometric presentation of attachment vectors. The fragments that are most similar to the query fragment will appear in a hitlist.
All similar fragments in a BROOD hitlist are organized into clusters. The clusters are organized so that molecules with the same ring structures and core framework (reduced-graph) are placed in the same cluster. The first cluster in BROOD hitlists is always the molecules that are similar to the query, sharing the same core atomic framework. While some of these analogs may be obvious, they often include alternative interesting chemistries. The remaining clusters are each organized around a unique core atomic framework. Each cluster is represented by its best scoring member and the clusters themselves are ranked by the score of their best member.
Clash detection and selectivity¶
BROOD allows users to specify protein structures for the purpose of testing whether newly constructed analogs fit into the active site. When the BROOD query ligand is based upon a crystallographic co-crystal structure, BROOD builds the newly created analog molecules in the same shape and orientation as the query. If the crystallographic ligand was originally in an active site and the protein is passed to BROOD as the bump protein, the new analogs will be built in poses that are also in the bump protein’s active site. When a bump protein is passed, BROOD checks for clashes between the bump protein and each analog. By default, if any ligand heavy atom is less than 2.25 Angstroms from a protein heavy atom, the analog is removed from the hitlist.
Users can also specify another protein for selectivity testing (referred to here as the selectivity protein). In order for an analog to remain in the final hitlist, it must clash with the selectivity protein. The bump protein and the selectivity protein should be aligned and the BROOD query should be based on a molecule that fits within the bump protein active site. In this case, analogs in the final BROOD hitlist will also fit into the bump protein’s active site, while they will also have clashes with the selectivity protein. This combination is a simple model for analogs that have a chance to be active against the bump protein, but have a very low chance of being active against the selectivity protein. This model assumes the analog molecules bind to the bump protein in a manner that is similar to the original ligand and that the most favorable pose for the ligand class in the selectivity protein is similar to that in the bump protein.
BROOD‘s output includes newly constructed analog molecules that are intended to have a similar 3D shape to the query molecule. These new analogs are constructed partially from the original molecule and partially from new fragments. When these new molecules are generated and built into a conformation that has good shape and chemistry overlap with the query molecules, some strain may be introduced. To produce high-quality results, it is essential that the analogs are optimized while maintaining the query shape and that little strain is introduced in the process.
The BROOD search process guarantees that each of the molecular fragments alone is in a low energy state. After the fragments are joined, this may no longer be true. For every BROOD analog in the final hitlist, two optimizations are carried out to determine the local strain introduced by maintaining a shape similar to the original query. In the first optimization, the ligand is allowed to relax into a local minimum. In the second optimization, the ligand atoms are only allowed to move a fraction of an Angstrom, keeping the same overall shape of the molecule. In both calculations, the OEMMFF [Brood-Halgren-1996-1], [Brood-Halgren-1996-2], [Brood-Halgren-1996-3], [Brood-Halgren-1996-4], [Brood-Halgren-1996-5], [Brood-Halgren-1999-1], [Brood-Halgren-1999-2] potential is used with a Sheffield solvation function [Brood-Grant-2007]. The local strain energy is the difference in ligand energy between the two calculations. By default, the maximum strain for any successful BROOD analog is limited to 6.5 kCal/M.
One approach to lead identification and development is based on the identification and expansion of physically very small molecule inhibitors that are commonly termed “fragments” (see [Hajduk-2007]). Fragments in this sense are molecules with few atoms and should not be confused with the term fragment used elsewhere in this document that refers to part of a molecule. Nevertheless, the fragment replacement algorithm in BROOD can be useful in the modeling of fragment-based design. In fragment-based design, one strategy is to combine two non-overlapping inhibitory fragments to form a single molecule. While this is sometimes empirically done with a series of flexible linear linkers, the linkers can also be modeled. In the BROOD GUI, it is possible to load two fragments and use the -linkOnly search to identify potential linkers that can join the fragments with an energetically favorable, medicinally relevant linker in a low-energy conformation. This search can be carried out in a protein’s active site, taking account of the need for the linker to fit into the active site as well as join the fragments. For more information on this application, please see the tutorials section.
One of the first applications of BROOD many users want to explore is the replacement of a flexible portion of a molecule with a more rigid fragment that fills the same space. BROOD excels at this application. This type of exercise can be considered a local cyclization. In some molecules, rather than a local cyclization, some chemists prefer to design a bridge between two portions of a molecule that do not have a local link. This is another task where BROOD can be quite useful. Long-distance cyclization, like fragment joining, is about finding a chemical fragment that can bridge to moieties given a particular 3D orientation. Use the -linkOnly option for long-distance cyclization.
In the early 1980’s, Bertz first published a measure of the complexity of molecules and asserted that his calculated complexity could be related to synthetic ease ([Bertz-1981], [Bertz-1982]). Bertz built complexity terms that are similar to a Shannon entropy ([Shannon-1949]) but with regard to the elements in a molecule and the diversity of small fragments that make up the structure of the molecule. While the actual synthetic accessibility can be heavily influenced by the availability of complex synthetic building blocks and advances in stereo synthetic methods, molecular complexity remains a useful tool for prioritizing which compounds chemists should look at first when primary modeling methods don’t readily distinguish between them.
In 2007, Boda and coworkers extended Bertz’s idea and compared it to experimental chemists’ predictions of synthetic accessibility ([Boda-2007]).
Boda made the significant advance of adding stereo complexity to Bertz’s elemental, graph, and ring complexity. We noted in the paper that nearly all the signal was generated by the molecular complexity and stereo complexity. In fact, each of these measures, without the linear fitting presented in the paper, correlated with chemists’ predictions of synthetic accessibility as well as the chemists’ predictions correlated with one another. Thus, in BROOD, we have implemented a molecular complexity that is a normalized sum of the graph and size complexity, elemental complexity, and stereo complexity. The normalized complexity score starts at zero for the simplest, smallest molecule and grows to values that generally don’t exceed 1.0 for medicinally relevant small molecules.
BROOD uses molecular complexity to sort analog molecules in the final hitlist that have very similar shape and color scores.
Aromatic fragment comparison¶
This is an example of applying BROOD as a technology to explore fragment similarity outside the direct context of a specific drug-discovery project. Tu and coworkers at Pfizer explored the chemical space of aromatic ring systems ([Tu-2012]). In their work, they used BROOD at the core of the NEAT (Novel and Electronically Equivalent Aromatic Templates) tool. This tool uses a combination of high-level QM-derived partial charges along with BROOD‘s electrostatic similarity calculation to explore potentially aromatic system replacements. It allows medicinal chemists to explore the large space of complex aromatic ring systems for replacements that are both electronically and sterically analogous. We present this example as an illustration of the more diverse applications of BROOD‘s fragment-matching technology.
Shape-based belief model¶
It is a well-known concept in medicinal chemistry that compounds with greater similarity have a higher probability of having shared properties. It is by this premise that project teams seek to explore chemical space around a lead compound in order to discover new active molecules. Based on this concept, Muchmore and colleagues at Abbott attempted to quantify the relationship between various measures of ligand similarity and binding activity ([Muchmore-2008]). A large number of ligands with activities measured across a large number and wide variety of targets were used to generate the probability that two molecules with a given similarity would have activity within one log unit of one another (the p[active]). Several well-regarded ligand similarity techniques showed sigmoidal curves where, at very low similarity, the probability of shared activity was related to the prevalence of inhibitors in the underlying data; whereas, as two molecules approached very high similarity, the probability of similar activity approached something like 35-55%. While perhaps surprising at first, both of these results are sensible. It is well known that while similar molecules are much more likely to have shared activity than two random molecules, similarity is by no means a guarantee of shared activity.
As reported in their paper, one of the similarity techniques with the highest maximum probability was the combo score (shape + color score) from the OpenEye tool ROCS. The critical change from color score to color Tanimoto occurred in ROCS since the publication of Muchmore’s paper. The change to color Tanimoto resulted in an even higher maximum probability of 0.512 ([Brown-2008]) for shape + color Tanimoto. The curve that was refit using the Abbott data is used in BROOD to convert the overall shape + color Tanimoto similarity of the analog molecule to the query molecule to generate a p(Active) value that indicates the probability, according the Abbott’s belief model, that the analog compound will have activity within one log unit of the query compound.
Abbott bioavailability score¶
In 2005, Martin published a predictive model for the probability that a compound will have bioavailability (f) > 10% in rats ([Martin-2005]). Despite attempts to identify a useful model using straightforward linear combinations of simple parameters, such as logP, logD, donors, acceptors, PSA, or flexibility, Martin noted that different predictive models were required for different ionization states. Anions require a bioavailability model that depends strongly on PSA. By contrast, neutral species, cations, and zwitterions require a model based on the rule-of-five. The combination of these models provides a single model for bioavailability in rats.
An essential feature of these methods is generation of a database of potential fragments. While it may be tempting to generate fragments de novo, these approaches often generate unrealistic chemical fragments. Particularly in regards to a method that is related to a common medicinal chemistry technique, we feel it is important to propose known fragments.
The initial database that comes with BROOD is derived from the ChEMBL 20 database. The compounds are fragmented resulting in approximately 11 million unique molecular fragments after standard property filtering. The fragments are prioritized according to their medicinal and geometric relevance to fragment replacement and a final collection of approximately six million fragments are retained for the database.
Users may also provide their own fragment database for searching. These fragments databases can be prepared from molecule collections using the CHOMP program. CHOMP breaks the molecules into fragments, filters the fragments, enumerates undefined stereochemistry, tracks molecules from which the fragments came, and identifies the unique collection of fragments. Once this set is generated, CHOMP generates or reads multiconformer representations for each fragment. The conformers can either be generated by OMEGA technology within CHOMP or extracted from small-molecule crystal structure databases passed into CHOMP. As a final step in database generation, CHOMP precalculates physical and geometric properties, organizes the fragments for efficient retrieval, and writes a database format that is optimized for efficient BROOD searching.
BROOD‘s database generation program, CHOMP, fragments molecules by identifying critical bonds that can be broken. CHOMP includes three sets of bond-breaking patterns: RLF (ring, linkers & functional groups), RECAP rules [Lewell-1998], and ALL (indicating breaking all non-ring, non-resonance single bonds). By default, CHOMP breaks all bonds identified using any of these methods. Users can also specify a SMARTS file of their own bond identifiers. This file should include a series of SMARTS patterns (one per line) that each define two atoms on opposite ends of the bond to be broken. For example, a line with the SMARTS “-” will cause all single bonds to be broken, while a line with the SMARTS “[R]-[!R]” will cause all bonds between ring atoms and non-ring atoms to be broken and “[#6]-!@[!#6!#1]” will cause CHOMP to break every single non-ring bond between a carbon atom and a heteroatom.
The RLF chemical heuristics seek to break compounds into three types of primary fragments; contiguous ring systems, functional groups, and linkers. Contiguous ring systems include any set of atoms that are bonded together by at least one ring bond. Thus, fused rings and spiro rings are included as a single ring system, but biphenyl is broken into two ring systems. Functional groups are defined as any collection of bonded atoms including one or more heteroatoms or unsaturated carbons separated by at most a single fully saturated carbon atom. The linkers are the remaining saturated carbon skeletons. It should be noted that linkers, like functional groups and ring systems, can be terminal (i.e., degree 1).
CHOMP systematically identifies all the molecular fragments that can be generated by breaking one or more of the specified bonds. CHOMP first eliminates fragments that have more than 15 heavy atoms or that have more than three attachment points. CHOMP also filters the fragments based on commonly used molecular filters, eliminating unstable, reactive, or toxic functional groups. As a final fragment generation step, for every fragment with two attachment points, CHOMP generates two single attachment fragments; for every fragment with three attachment points, CHOMP generates six fragments with two attachment points and three fragments with single attachment points by capping the open attachment valence with hydrogens. CHOMP then eliminates any duplicate fragments.
CHOMP offers additional options for duplicate removal. To eliminate a set of potential fragments, such as those from a previous proprietary fragment database, you can specify the file containing those fragments with the -userUnique flag.
In many instances, users already have sets of fragments generated by their own means. CHOMP will read these fragments directly using the -userFrags flag. This flag can be used to generate a BROOD database using only these fragments or in conjunction with fragments generated using the molecular fragmentation process discussed above. In either case, BROOD will process the user-defined fragments first and process them in the order they are read from the -userFrags file. This feature is important for use in conjunction with BROOD‘s -quickLook flag because those fragments will be searched first.
This first portion of the CHOMP algorithm involves only graph processing and can proceed very quickly. Please note, however, that when processing molecular databases of more than a few million compounds, CHOMP will use significant amounts of memory. Memory usage is especially aggressive early in the execution, when many new fragments are being identified, and slows somewhat as the algorithm progresses. For this reason, about 1G of memory for every million molecules you wish to process is recommended. If necessary, you can break a large molecular database into several groups, run each individually, and save their intermediate results before recombining them prior to the final stages of the CHOMP algorithm.
To write intermediate files, whether for the purpose of breaking a database into chunks or simply to investigate the fragments CHOMP is generating, a user can set the -omega flag to False. This will cause CHOMP to fragment the molecules but not generate a 3D database. The 2D output of this calculation is specified with the -out flag, but with the ISM suffix, which contains all the fragments processed so far. These molecules can be examined, filtered, sorted, joined with additional SMILES, and otherwise manipulated. The same molecules can be used with the -userFrags flag to generate a BROOD database. Other than the changes made by manipulating this intermediate output file, the databases generated by stopping at this intermediate step will be the same as one generated by executing CHOMP with the -omega flag set to True from the start.
The next stage of the CHOMP algorithm is to generate 3D conformers using a specialized version of the OMEGA algorithm (operated here with only a BROOD license). CHOMP‘s version of OMEGA has a modified TORLIB in order to sample more generously around the attachment points. It also differs from default OMEGA parameters in having tighter constraints for removal of RMSD duplicates (as the measure is very sensitive to size) and for having scaled MAXCONFS and EWINDOW parameters based on the size of the fragments. Finally, CHOMP writes the fragments into a BROOD database. To accomplish this, CHOMP carries out several precalculations, including generating physical properties and adding color atoms. CHOMP also segregates the fragment conformers into groups that are likely to match the same types of queries. All these processes help BROOD search the databases more quickly.
Like the OMEGA application, the OMEGA algorithm inside CHOMP can take up to one or occasionally two seconds per fragment. For this reason, CHOMP can be run in multi-processor mode using MPI.
As an alternative to using OMEGA to generate conformers, CHOMP can extract the fragment conformers from a small-molecule structural database. When a small-molecule structural file is passed to CHOMP using the -readConfs flag, it signals that CHOMP should read the conformers rather than generating them. For every fragment identified by CHOMP, it searches the structural file and identifies all instances of the fragment. For each instance, the coordinates are extracted and added as a conformer of the fragment. Finally, duplicate conformers for each fragment are removed. In this manner, fragment conformers from any source can be included in the BROOD database.
BROOD databases are currently in the format of a directory or folder that contains many files. The databases can be easily compressed and uncompressed with standard compression algorithms. They can also be moved, but not easily renamed. To rename the BROOD database, you must also rename all the files contained within it. Consistency is required to prevent confusion between the name of the database and the files being searched under that name.
Of the files in the database directory, only one is human-readable. This is the .info file. If you open the .info file with a text reader, you will see some information about the database, its version, and a manifest of the files that are supposed to be in the database. Whenever BROOD or BroodDBMerge (see BroodDBMerge) reads a database, they check this manifest to ensure the database has not been corrupted.
After generating a database, you may want to combine the new database with other databases, such as BROOD‘s default database. BroodDBMerge accomplishes this task using three required parameters -in1, -in2, and -out. This allows combining two databases in the directories specified by the two in options and writing a new database to the directory specified by the -out flag. By default, BroodDBMerge will not overwrite a database. For more information about merging databases, please see the next chapter.