BROOD Database Generation (CHOMP)¶

CHOMP offers additional options for duplicate removal. To eliminate a set of potential fragments, such as those from a previous proprietary fragment database, you can specify the file containing those fragments with the -userUnique flag.

In many instances, users already have sets of fragments generated by their own means. CHOMP will read these fragments directly using the -userFrags flag. This flag can be used to generate a BROOD database using only these fragments or in conjunction with fragments generated using the molecular fragmentation process discussed above. In either case, BROOD will process the user-defined fragments first and process them in the order they are read from the -userFrags file. This feature is important for use in conjunction with BROOD’s brood -quickLook flag because those fragments will be searched first.

This first portion of the CHOMP algorithm involves only graph processing and can proceed very quickly. Please note, however, that when processing molecular databases of more than a few million compounds, CHOMP will use significant amounts of memory. Memory usage is especially aggressive early in the execution, when many new fragments are being identified, and slows somewhat as the algorithm progresses. For this reason, about 1G of memory for every million molecules you wish to process is recommended. If necessary, you can break a large molecular database into several groups, run each individually, and save their intermediate results before recombining them prior to the final stages of the CHOMP algorithm.

To write intermediate files, whether for the purpose of breaking a database into chunks or simply to investigate the fragments CHOMP is generating, a user can set the -omega flag to False. This will cause CHOMP to fragment the molecules but not generate a 3D database. The 2D output of this calculation is specified with the -out flag, but with the ISM suffix, which contains all the fragments processed so far. These molecules can be examined, filtered, sorted, joined with additional SMILES, and otherwise manipulated. The same molecules can be used with the -userFrags flag to generate a BROOD database. Other than the changes made by manipulating this intermediate output file, the databases generated by stopping at this intermediate step will be the same as one generated by executing CHOMP with the -omega flag set to True from the start.

Like the OMEGA application, the OMEGA algorithm inside CHOMP can take up to one or occasionally two seconds per fragment. For this reason, CHOMP can be run in multi-processor mode using MPI.

As an alternative to using OMEGA to generate conformers, CHOMP can extract the fragment conformers from a small-molecule structural database. When a small-molecule structural file is passed to CHOMP using the -readConfs flag, it signals that CHOMP should read the conformers rather than generating them. For every fragment identified by CHOMP, it searches the structural file and identifies all instances of the fragment. For each instance, the coordinates are extracted and added as a conformer of the fragment. Finally, duplicate conformers for each fragment are removed. In this manner, fragment conformers from any source can be included in the BROOD database.