BROOD Database Generation (CHOMP)¶
CHOMP offers additional options for duplicate removal. To eliminate a set of
potential fragments, such as those from a previous proprietary fragment
database, you can specify the file containing those fragments with the
-userUnique
flag.
In many instances, users already have sets of fragments generated by their own
means. CHOMP will read these fragments directly using the -userFrags
flag. This flag can be used to generate a BROOD database using only these
fragments or in conjunction with fragments generated using the molecular
fragmentation process discussed above. In either case, BROOD will process the
user-defined fragments first and process them in the order they are read from
the -userFrags
file. This feature is important for use in
conjunction with BROOD’s brood -quickLook
flag because those fragments will be
searched first.
This first portion of the CHOMP algorithm involves only graph processing and can proceed very quickly. Please note, however, that when processing molecular databases of more than a few million compounds, CHOMP will use significant amounts of memory. Memory usage is especially aggressive early in the execution, when many new fragments are being identified, and slows somewhat as the algorithm progresses. For this reason, about 1G of memory for every million molecules you wish to process is recommended. If necessary, you can break a large molecular database into several groups, run each individually, and save their intermediate results before recombining them prior to the final stages of the CHOMP algorithm.
To write intermediate files, whether for the purpose of breaking a
database into chunks or simply to investigate the fragments CHOMP is generating,
a user can set the -omega
flag to False
. This will cause CHOMP
to fragment the molecules but not generate a 3D database. The 2D output
of this calculation is specified with the -out
flag, but with the
ISM suffix, which contains all the fragments processed so far. These
molecules can be examined, filtered, sorted, joined with additional SMILES, and
otherwise manipulated. The same molecules can be used with the
-userFrags
flag to generate a BROOD database. Other than the changes
made by manipulating this intermediate output file, the databases generated
by stopping at this intermediate step will be the same as one generated by
executing CHOMP with the -omega
flag set to True
from the start.
Like the OMEGA application, the OMEGA algorithm inside CHOMP can take up to one or occasionally two seconds per fragment. For this reason, CHOMP can be run in multi-processor mode using MPI.
As an alternative to using OMEGA to generate conformers, CHOMP can extract the fragment conformers from a small-molecule structural database. When a small-molecule structural file is passed to CHOMP using the -readConfs
flag, it signals that CHOMP should read the conformers rather than generating them. For every fragment identified by CHOMP, it searches the structural file and identifies all instances of the fragment. For each instance, the coordinates are extracted and added as a conformer of the fragment. Finally, duplicate conformers for each fragment are removed. In this manner, fragment conformers from any source can be included in the BROOD database.