FILTER provides a range of properties to predictors to be used as molecular filters. Molecular properties are distinct physical properties that can be measured such as the following:
There have been several attempts at developing fast-approximate QSAR models for bioavailability that have been published. The first of these was Lipinski’s work ([Lipinski-1997]) and it has been followed by work at Pharmacopia [Egan-2000], Abbott [Martin-2005], and GSK [Veber-2002]. The simplest and probably most trusted is the work of LogP and PSA used by Egan. The most recent work by Martin, the Abbott Bioavailability Score (ABS) appears to be a refinement of the first generation models and is designed specifically to categorize a molecule’s probability of having a bioavailability>10% in rats.
The Pharmacokinetic Predictors section in the Filter Files chapter.
There are a number of important structural and chemical features of molecules that it is desirable to limit for virtual screening. The following simple measures are provided as filterable properties:
It also includes slightly more complex algorithms for hydrogen-bond donors and acceptors as well as chiral centers.
The Basic Properties section in the Filter Files chapter.
Functional group removal remains at the heart of the filter algorithm. Functional groups fall into several categories including:
While reasonable defaults are provided for each functional group in all of the above categories, it is unusual to find an experienced computational or medicinal chemist who agrees with all of the default values. Examining the filter file at least once is strongly encouraged. Developing a custom filter to fit a desired design is also strongly encouraged. The filter files were designed as basic guides.
The filter files contain only functional group names. If there is any confusion regarding the definitions, please refer to the Functional Group Rules section for complete descriptions.
Years ago, many high-throughput assays could be interfered with by colored molecules. While assay technology has continued to advance and often this is not a problem, it is recognized that most molecules that are dyes are not the type of molecules that are commonly carried forward in lead-development projects. Therefore, a pattern-based filter for dye molecules is included. While these patterns occasionally identify molecules that are considered acceptable by some users, in general, they identify molecules that the majority of chemists would rather not see at the top of their virtual-screening hitlists.
The XLogP algorithm ([Wang-1997-2]) is provided because its atom-type contribution allows calculation of the XLogP contribution of any fragment in a molecule and allows minimal corrections in a simple additive form to calculate the LogP of any molecule made from combinations of fragments. Further, although the method contains many many free parameters, its simple linear form allows for ready interpretation of the model and most of the parameters in the model make rational sense.
Unfortunately, the original algorithm is difficult to implement as published. First, the internal-hydrogen bond term was calculated using a single 3D conformation. It was found that this was both arbitrary and unnecessary. This arbitrary 3D calculation has been replaced with a 2D approach to recognize common internal-hydrogen bonds. In tests, this 2D method worked comparably to the published 3D algorithm. Next, the training set had a few subtle atom-type inconsistencies.
Both of these problems were corrected and refit to the original XLogP training data. This implementation gives results that are quite similar to the original XLogP algorithm, so it is called OEXLogP to distinguish it from the original method.
The work of [Yalkowsky-1980] at Arizona has resulted in what is now called the “generalized solvation equation.” It states that the solubility of a compound can be broken into two steps, first the melting for the pure solid to pure liquid and second, the phase transfer from pure liquid into water. For many small organic molecules, this second step is somewhat related to LogP.
Because of this relation we choose to explore the use of the XLogP atom-types in solubility prediction. The expectation was that this might provide an approximate though robust and fast method for calculating solubility. We fit the XLogP atom-types to a training set of nearly 1000 public solubilities. From this we derived a linear model for solubility. The model is extremely fast and is useful for classifying compounds as insoluble, poorly soluble, slightly soluble, moderately, soluble or very soluble. The model is notable for the difficulty it has predicting solubilities for compounds with ionizable groups. Further, it is not suitable for the PK predictions that come late in a project. However, it is useful for eliminating compounds with severe solubility problems early in the virtual-screening process.
The solubility parameter in the Filter Files chapter.
Topological polar-surface area (TPSA) is based on the algorithm developed by Ertl et al [Ertl-2000]. In Ertl’s publication, use of TPSA both with and without accounting for phosphorus and sulfur surface-area is reported. However, evidence shows that in most PK applications one is better off not counting the contributions of phosphorus and sulfur atoms toward the total TPSA for a molecule. This implementation of TPSA allows either inclusion or exclusion of phosphorus and sulfur surface area with the default being to not include it.
Including phosphorus and sulfur surface area:
TPSA values are mildly sensitive to the protonation state of a molecule. If the pKaNorm parameter is false, the TPSA value is calculated using the input structure and if pKaNorm is true, the TPSA is calculated using the pKa normalized molecular structure.
The work of Lipinski ([Lipinski-1997]) introduced the application of simple filter-like rules to roughly predict late-stage PK properties, in particular oral bioavailability. Unfortunately, Lipinski’s “Rule-of-Five” has come into the common vernacular to such a large degree that some of the specific details are often lost in the commotion. There are two critical examples of this. First, Lipinski used “violation of 2 rules” to categorize compounds. In the subsequent analysis, a significant difference in the two populations of molecules was detected, however, little analysis of the importance of a single violation was done. However, many now consider a single violation to be bad and two violations to be worse. At this writing, there is no known evidence to support this. Second, Lipinski used well-codified and well-understood yet imprecise definitions of “hydrogen-bond donors” and “hydrogen-bond acceptors” in his classification model. While this makes the algorithm quite understandable and easy to implement, it sometimes causes confusion with those who prefer more refined definitions of hydrogen-bond donors and acceptors.
To address the first problem, users are allowed to set the number of Lipinski failures required to reject a molecule. In keeping with the original publication, the default value is 2. To address the second problem, two kinds of hydrogen-bond donors and acceptors are calculated. In the Lipinski calculation, the published definitions are used (donor count is the number nitrogen or oxygen atoms with at least one hydrogen attached and acceptors are the number of nitrogens and oxygens). For calculation of the number of donors and acceptors in the molecule for the sake of chemical properties, a more complex algorithmic approach is taken. This approach identifies the donors and acceptors outlined in the work of Mills and Dean ([MillsDean-1996]) and also in the book by Jeffrey ([Jeffrey-1997]).
The Shoichet lab ([McGovern-2003], [Seidler-2003]) has demonstrated the importance of small-molecule aggregation in medium and high-throughput assays. Because these small-molecule aggregates can sequester some proteins, they give the appearance of being active inhibitors. There are now several hundred published aggregators in addition to a published QSAR model for predicting aggregation propensity. The user has the ability to eliminate any of the known aggregators as well as the ability to eliminate compounds that are predicted to be aggregators using the QSAR model. In OpenEye’s experience, the published QSAR model for predicting aggregators is quite aggressive. It occasionally identifies compounds that are known to be genuine small-molecule inhibitors of a specific protein. While in most cases this model can be useful, until more definitive work is published, we feel you should gain some experience with the model and judge its performance for yourself.
More recent work in industry indicates that in some cases, aggregation properties are specific to the particular experimental conditions being used. Thus we recommended caution in the interpretation of these predictions. Nevertheless, aggregation remains an important issue in HTS hit follow-up and, short of experimental validation, this flag may be the best-available.
The aggregators parameter in the Filter Files chapter.
The PAINS filters ([Baell-2010]) are a set of functional group filters created to identify and filter promiscuous (non-specific) actives across a number of screening types and targets. The original PAINS paper includes a set of functional group prefilters, created via two iterations of their work, and three sets of structure class filters, called filter sets A, B, and C, which were used to identify and remove promiscuous actives. The OpenEye PAINS filter set is a combination of all four filter sets (the functional group prefilter set plus the three structural class sets) into a single filter set.
There are two distinct use cases for the PAINS filters. The first is the typical prefiltering of structures prior to screening or compound selection to remove undesirable molecules. One can simply eliminate any molecules which fail the PAINS filtering.
The second use case, which may ultimately be more interesting, is to use the PAINS filters retrospectively on a set of data which has been filtered (and tested) using another methodology. In this scenario the PAINS filters can be used to identify potentially problematic reactive functionality in otherwise reasonable, drug-like structures. By using the table output mode and identifying matches for the A, B, and C filter sets, one can identify fairly specific chemical functionalities that have been known to result in non-specific activity. This can be useful information for either prioritizing further lead development or for modifying lead structures to mitigate adverse reactivity.
There are approximately 650 SMARTS rules which have been converted from the SLN query strings provided in the supplemental information for the original paper. For the functional group filters, the authors do not provide identifying names for each individual filter, so our PAINS rules are named based on the comments used to describe each functional group class, prefixed with the string pains_fg_[classname]. For the structural class filters, the authors provided unique RegID values for each filter. For those cases our rules are named based on the PAINS group and RegID as pains_[group]_[RegID].
Note that in certain cases it isn’t possible to exactly represent the SLN queries given in the paper with a single SMARTS, so in those cases the rules have been split out into multiple SMARTS patterns and are named pains_[group]_[RegID]1, pains_[group]_[RegID]2, etc.
Unlike the other filter types, the PAINS filter consists only of functional group filters; no physical property filters are included. It may be desirable to combine the PAINS filter with a separate user-defined set of physical property filters. The section filter file describes creating additional filter rules.