Filtering Theory

Introduction

Filtering attempts to eliminate inappropriate or undesirable compounds from a large set before beginning to use them in modeling studies. The goal is to remove all of the compounds that should not be suggested to a medicinal chemist as a potential hit. This exercise is obviously case dependent, depending on ease of the assay, intended target, personal bias of the modeler & medicinal chemist, strengths of the company, etc.

To match this need, the default filter encapsulates many of the standard filtering principles, such as removal of unstable, reactive, and toxic moieties. In addition, MolProp TK allows the customization of the filtering criteria to fit specific needs.

The criteria for passing or failing a given molecule fall into three categories.

Physical properties
- Molecular weight
- Topological polar surface area (TPSA)
- logP
- Bioavailability
Atomic and functional group content
- Absolute and relative content of heteroatoms
- Limits on a very wide variety of functional groups
Molecular graph topology
- Number and size of ring systems
- Flexibility of the molecule
- Size and shape of non-ring chains

All of the data generated in filtering molecules can be written to a tab-separated file for easy import into a spreadsheet. This functionality allows for combining the values dynamically for a variety of purposes, including, but not limited to, determining which filter values best fit each project’s needs.

History

When OpenEye’s work on filtering technology began in 2000, it was designed simply to remove compounds with reactive or otherwise undesirable functional groups. Over the years, the understanding of lead-like and drug-like compound selection has advanced. In addition, with the publication of Lipinski’s “Rule of 5” [Lipinski-1997], more and more pharmacokinetic properties have been pushed earlier into the virtual screening process.

In addition to providing basic functional group selection, the technology is a one-stop database preparation tool aimed at generating databases suitable for high-throughput virtual screening.

Cheminformatics quality-control
- Valence-state validation
- Aromaticity perception
- Implicit hydrogen perception
- Bond-order perception
Database preparation
- Setting pKa states
- Applying normalizations (tautomers and dative or hypervalent states)
Compound selection
- Physical properties (see above)
- Assay counter-indicators (aggregators and dyes)
- Promiscuous actives ([Baell-2010])
- PK ([Martin-2005], [Veber-2002], [Egan-2000], [Lipinski-1997])

Finally, it should be pointed out that in the virtual screening world, time is of the essence. Algorithms for preliminary database preparation should not take large amounts of time. Because of this, all the calculations included are 2D or graph-based algorithms. While this does occasionally limit the technology, it allows for the delivery of a product that is appropriate for the task of virtual-screening database preparation.

The Rant!

Nearly every computational tool used in early drug discovery yields statistically predictive, rather than absolutely definitive results. In nearly every case, prudence demands that one consider the causes of false-positives and false-negatives and make an attempt to optimize the area under the receiver-operator curve (ROC) for the computational tool. However, there are well known methods for improving statistical predictions of this nature that are independent of the absolute false-positive and false-negative rates. These methods include filtering the population to which a test will be applied. By applying a test to smaller populations that only contain molecules appropriate for the specific application at hand, the negative impact of the false-positive rate on the predictive results can be dramatically improved.

A familiar example from the medical world will serve to illustrate this principle. Assume we have a test for the presence of the new foo virus which has an exceptional ROC curve with false-positive and false-negative values (1/1,000 and 1/1,000 respectively). Let us assume that the foo-syndrome, caused by the foo virus, effects 1 person in 20,000. If we gave this test to 100,000 people from the general population, we would expect 5 to actually have the foo syndrome. With this test, there is only a 0.05% percent chance that any of them would not be detected (that is, be a false-negative). However, we would expect there to be 100 false positive test results. Thus of the 105 total positive test results, only 4.8% would actually have the foo syndrome (positive predictive value = 4.8%).

Confusion table for the **unfiltered** *foo virus* test (prevalence 1 in 20,000)
	Actual Positive	Actual Negative
Predicted Positive	True Positives = 5	False Positives = 100	Positive Predictive Value = 4.8%
Predicted Negative	False Negatives = 0	True Negatives = 99,895

Alternatively, we could start by using very simple screening before applying the test. We first eliminate people who do not have any risk factors for contracting the foo virus. Next we may eliminate people whose blood is incompatible with the test for the foo virus. Further, we may want to eliminate people who acknowledge that they will refuse treatment for the foo virus even if we determine that they do have it. By these admittedly simple screens, we apply the test for the foo virus to a much smaller group with a decidedly higher prevalence of the virus. For instance, after the filtering, we may be left with a group of only 1,000 people who have a 1 in 200 chance of having the syndrome. Now, we still have the same 5 people who actually have the disease, but we only expect 1 false positive test. Suddenly, there are 6 total positive tests, and 83% of them actually have the syndrome! This is reflected in a much more reasonable (83%) positive predictive value.

Confusion table for the **filtered** *foo virus* test (prevalence 1 in 200)
	Actual Positive	Actual Negative
Predicted Positive	True Positives = 5	False Positives = 1	Positive Predictive Value = 83%
Predicted Negative	False Negatives = 0	True Negatives = 994

Bringing the discussion back to drug design, if we have a ligand-based design tool such as ROCS, we can imagine that the receiver-operator curve may have a false positive rate as low as 1 in 10,000. For this exercise, let’s assume no false negatives. When using that to identify 50 inhibitors from a database of 2.5 million available compounds, we’d identify 300 potential inhibitors, and 5 out of every 6 of these would be a false positive (positive predictive value of 17%)! If we first run filter and eliminate 65% of the 2.5 million compounds, this leaves us with 875,000 compounds to push through ROCS. There will be about 88 false positives to go with the 50 true positives and the positive predictive value will increase over two-fold with relatively little work.

Confusion table for the **unfiltered** ROCS virtual screen
	Actual Positive	Actual Negative
Predicted Positive	True Positives = 50	False Positives = 250	Positive Predictive Value = 17%
Predicted Negative	False Negatives = 0	True Negatives = 2,499,700

Confusion table for the **filtered** ROCS virtual screen
	Actual Positive	Actual Negative
Predicted Positive	True Positives = 50	False Positives = 88	Positive Predictive Value = 36%
Predicted Negative	False Negatives = 0	True Negatives = 874,862

Filtering Principles

The same principle of increasing positive predictive value by removing obvious true negatives applies to screening for lead candidates, regardless of whether it is virtual screening or high-throughput screening. While both are reasonable screens, each can be plagued by very low positive-predictive values (despite low false-positive rates), particularly when applied to all available compounds, or large virtual libraries. Simple filtering techniques focus the set of compounds passed on to more computationally intensive screening methods.

The first approach to consider is filtering based on functional groups. Generally speaking, there are toxic and reactive functional groups that you simply do not want to consider (alkyl-bromides, metals etc). There are also functional groups that are not strictly forbidden, but are not desired in large quantities. For instance, parafluoro-benzene, or trifluoromethyl have specific purposes, but heavily fluorinated molecules can be eliminated.

Beyond simple functional group filtering, you can consider both simple and complex physical properties which can be used to characterize the kinds of compounds you would like to keep and those you would like to eliminate. These properties attempt to consider “drug-likeness”, such as bio-availability, solubility, toxicity, and synthetic accessibility even before the primary high-throughput or virtual screening, which primarily are geared toward detecting potency alone. The best known of the physical property filters is Lipinski’s “rule-of-five”, which focuses on bioavailability [Lipinski-1997]. However, many other physical properties, such as solubility, atomic content, ring structures, and surface area ratios can also be considered. MolProp TK provides algorithms for calculating many of these properties, and applying them with filters based on literature studies.

Finally, you should eliminate the types of compounds that can be troublesome at later stages. For instance, Shoichet’s aggregating compounds often produce false positives that can waste enormous resources if they were identified by virtual or high-throughput screening [McGovern-2003] [Seidler-2003] . Similarly, dyes can appear to be inhibitors by interfering with colorimetric or fluorometric assays or binding non-specifically to the target protein.

Variations of Filters

Different types of filters are appropriate under different circumstances. Very early in a project, when little or no SAR is available, very strict drug-like filters can be applied. This prevents a project team from spending chemistry resources pursuing difficult compounds that may not be modifiable to introduce appropriate properties. However, when considering compounds for purchase for HTS, different filters can be applied. Oprea, et al, pointed out that the best molecules for initial HTS are smaller and less functionalised than drugs, but with some activity [Oprea-2000]. Therefore, strict lead-like filters can be applied to ensure that hits identified from HTS have sufficient “room” for elaboration into (usually larger and more highly functionalised) leads. However, when SAR suggests that particular compounds or series may yield valuable information, filtering criteria can be loosened, because the secondary screens (QSAR models, similarity to known actives) that are being applied are effective in detecting useful compounds. Reflecting back on the medical analogy, this is the case where an improved primary screen with a dramatically improved false-positive rate (say 1 in 100,000) can be safely applied to a larger population without terrible effects on the positive-predictive value.

MolProp TK provides the following filters:

BlockBuster: The BlockBuster filter is based on 141 best-selling, non-antibiotic, prescription drugs. We designed the physical property portion of the filter so that it passes all of the compounds. The physical property values in this filter are quite good. However, the functional group filters in this filter are probably too restrictive because 141 compounds is not sufficient to span all acceptable functionality.
Drug: The original Drug filter ([Oprea-2000]) is provided as well. However, experience has shown it to be too restrictive. The BlockBuster filter was developed in response to complaints about the Drug filter being too restrictive.
Fragment
Lead: The Lead filter corresponds to the [Oprea-2000] lead-like filters useful for preparing HTS screening databases.
PAINS: The PAINS filter is based on [Baell-2010] which describes work to identify and filter promiscuous (non-specific) actives across a number of screening types and targets. Unlike the other filter types, the PAINS filter consists only functional group filters; no physical property filters are included. In practice it may be desirable to combine the PAINS filter with a separate user-defined set of physical property filters.

Hint

We recommend the BlockBuster filter, the default, for most purposes. If your project is unusual, or you are unsatisfied with the results we recommend you review the filter file for your specific filtering needs.

If you decide to modify the filter file the depictions found in the Functional Group Rules section can be particularly helpful in determining what functional groups are indicated by each name in the file.

Accumulation of Rules

FILTER contains numerous rules that judge the quality of molecules on many different facets. When examined individually, each of these rules seems quite reasonable and even profitable. However, when each molecule is tested against hundreds of individual filters, the fraction of molecules that pass all the filters can be surprisingly small. Sometimes less than 50% of vendor databases pass the filters. If this is unacceptable we recommend you examine the predicted aggregator, solubility, and Veber filters. In our experience, these are the most common failures. The best method of investigating failures is looking at the filter log.

The BlockBuster filter was adjusted to demonstrate this in a tangible way. For each value, rather than spanning the entire range, its properties were set to cover from the 2.5th percentile to the 97.5th percentile. The differences between the original BlockBuster filter and the adjusted filter are both in reasonable ranges. For instance, the full range of molecular weight for the BlockBuster filter spans 130 to 781, while the 2.5th percentile is 145 and the 97.5th percentile is 570. The remarkable result is that when the reduced filter is used, only 75 of the 141 original molecules pass the filter! This demonstrates how slight changes to many filters can lead to a significant reduction in the number of compounds that pass all of the filters.

Taking the opposite approach of allowing everything to pass can be equally futile. To demonstrate this a filter was designed around the “small molecule drug” file available from the DrugBank website. In order to pass every one of these molecules, including some that would not be acceptable for modern project work, many of the individual filters must be set to unreasonable values. For instance, the molecular weight range is 30 to 1500 and the hetero-atom count range is 1 to 60!

Hint

OpenEye can not magically divine the needs of every project. You should personally inspect the filter file. The depictions found in the Functional Group Rules section can be particularly helpful in discerning what functional groups are desirable and undesirable.