Introduction
ROCS X is a large-scale virtual screening tool that can build and search libraries with over a trillion products. Its main feature is a 3D similarity search to find the most similar molecules in the library to a query. Another feature is a 2D substructure search to find exact substructure matches in the library. Currently, ROCS X supports molecule queries in 3D similarity searches for ligand-based screening.
There are two main steps when running ROCS X: preparing the library and searching the library. This introduction covers the key concepts and technologies used in each part.
Preparing the Library
ROCS X achieves its scale by using combinatorial libraries for storing and searching product molecules. A combinatorial library with one trillion products does not explicitly store the data for a trillion molecules, which would be prohibitively expensive. Rather, it stores the data for components (reagents, in this context) that can be combined to form the trillion products. The total number of components in the reagent space is much smaller than the total number of products. For example, in a reaction where one million “star” reagents can combine with one million “triangle” reagents to form one trillion “star–triangle” products, the total number of reagents (two million) is 500,000 times smaller than the total number of products.
Figure 1. Reagent space versus product space in a combinatorial library.
Search algorithms can also be accelerated with combinatorial libraries. For example, the ROCS X - 2D Substructure Search Floe works by piecing together partial hits in the reagent space to find full substructure hits in the product space. The algorithm achieves its scale because it searches millions of molecules in the reagent space, which is much faster than searching one trillion molecules!
Figure 2. Schematic for 2D substructure search.
To perform a 3D similarity ROCS search, one naturally needs 3D conformers. In fact, generating 3D conformers is the bulk of the cost for preparing a 3D library. Again, combinatorial libraries greatly reduce this cost because one only needs to generate 3D conformers for millions of synthons (virtual molecule components) rather than for trillions of products.
Figure 3 shows a schematic for how a ROCS X 3D Library is prepared starting from vendor building blocks. Notably, the ROCS X 3D Library stores two sets of conformers: one for the synthons and one for a sample of products. Both of these sets will be used to initialize the model for the 3D search. In addition to the documentation in this package, the tutorials in the Generative Design Hit to Lead Floes package provide detailed information about each step.
Figure 3. Schematic for preparing a ROCS X 3D library.
Note
The ROCS X - 2D Substructure Search Floe does not use the ROCS X 3D Library. Instead, it uses the 2D Synthon Library that is processed to build the ROCS X 3D Library. See the Run a 2D Substructure Search on a ROCS X Library How-To Guide.
Searching the Library
The ROCS X 3D search uses reinforcement learning and Thompson sampling in a multi-armed bandit framework. This framework is used to sample the best hits from a trillion-scale product space based on decisions. “Good” decisions (that result in high-scoring 3D-similar products) are rewarded so that they are carried out more frequently; “bad” decisions are punished so that they are avoided. The reinforcement learning technique automatically balances the explore-exploit tradeoff in the decision space.
The search is carried out in two stages (Figure 4): a stage for initializing the bandit model (setting up the decisions) and a stage for sampling from the model.
Figure 4. Schematic for searching a ROCS X 3D library.
Initializing the bandit model uses both sets of conformers prepared in the ROCS X 3D Library (see Figure 3). First, a FastROCS™ search is performed with the product sample conformers against the query to obtain a hit list of top-scoring products from the initial sample. This hit list is used to “seed” the model (alternatively, a user can supply their own hit list from an external search to either replace or supplement the initial sample hit list). The products in the hit list are fragmented to construct “shape arms,” which form the decisions for the model.
Figure 5. Creating shape arms for the bandit model from an initial product sample.
Next, the shape arms are used as queries in another FastROCS search with the synthon conformers. The result of this search is a ranked hit list of synthons for each shape arm. Synthons at the top of the list are more 3D similar to the shape arm core and will be chosen earlier when the shape arm is selected.
Figure 6. FastROCS search to populate shape arms with synthons.
After the model is initialized, the 3D search is performed by running sampling trials on the model. In a sampling trial, two shape arms are selected. Synthons are chosen from the selected shape arms (starting from the highest scoring synthons) and are coupled together to form a product. The product is evaluated by a ROCS overlay, and the shape arms are rewarded according to the ROCS score. In this way, the model dynamically learns which decisions result in high-scoring products; it also learns to stay away from decisions that result in low-scoring products. The decisions taken in future sampling trials, which determine what products are sampled and scored, are affected by what the model learned. The amount of sampling can be controlled by the Number of Sampling Trials set for the search and is necessarily much smaller than the total number of products in the 3D library. With sufficient sampling, the algorithm recovers the majority of the top-scoring 3D similar hits and doesn’t waste time evaluating low-scoring products that won’t appear on the hit lists.
Costs
The payoff for using the above technologies (combinatorial libraries, FastROCS for model initialization, reinforcement learning, Thompson sampling, etc.) is that ROCS X is astoundingly cheap to search trillion-scale libraries. From back-of-the-envelope calculations, a brute-force search is prohibitively expensive for a trillion-scale library with conventional methods. On the other hand, preparing and searching a trillion-scale 3D library with ROCS X can be done at a low cost that will satisfy most virtual screening budgets.
Figure 7. Comparing the cost of conventional methods versus ROCS X.