Filtering Theory

Introduction

Filtering attempts to eliminate inappropriate or undesirable compounds from a large set before beginning to use them in modeling studies. The goal is to remove all of the compounds that should not be suggested to a medicinal chemist as a potential hit. This exercise is obviously case dependent, depending on ease of the assay, intended target, personal bias of the modeler & medicinal chemist, strengths of the company, etc.

To match this need, the default filter encapsulates many of the standard filtering principles, such as removal of unstable, reactive, and toxic moieties. In addition, FILTER allows the customization of the filtering criteria to fit specific needs.

The criteria for passing or failing a given molecule fall into three categories.

  • Physical properties

    • Molecular weight

    • Topological polar surface area (TPSA)

    • logP

    • Bioavailability

  • Atomic and functional group content

    • Absolute and relative content of heteroatoms

    • Limits on a very wide variety of functional groups

  • Molecular graph topology

    • Number and size of ring systems

    • Flexibility of the molecule

    • Size and shape of non-ring chains

All of the data generated in filtering molecules can be written to a tab-separated file for easy import into a spreadsheet. This functionality allows for combining the values dynamically for a variety of purposes, including, but not limited to, determining which filter values best fit each project’s needs.

History

When OpenEye’s work on filtering technology began in 2000, it was designed simply to remove compounds with reactive or otherwise undesirable functional groups. Over the years, the understanding of lead-like and drug-like compound selection has advanced. In addition, with the publication of Lipinski’s “Rule of 5” [Lipinski-1997], more and more pharmacokinetic properties have been pushed earlier into the virtual screening process.

In addition to providing basic functional group selection, the technology is a one-stop database preparation tool aimed at generating databases suitable for high-throughput virtual screening.

  • Cheminformatics quality-control

    • Valence-state validation

    • Aromaticity perception

    • Implicit hydrogen perception

    • Bond-order perception

  • Database preparation

    • Setting pKa states

    • Applying normalizations (tautomers & dative or hypervalent states)

  • Compound selection

Finally, it should be pointed out that in the virtual screening world, time is of the essence. Algorithms for preliminary database preparation should not take large amounts of time. Because of this, all the calculations included are 2D or graph-based algorithms. While this does occasionally limit the technology, it allows for the delivery of a product that is appropriate for the task of virtual-screening database preparation.

The Rant!

Nearly every computational tool used in early drug discovery yields statistically predictive, rather than absolutely definitive results. In nearly every case, prudence demands that one consider the causes of false-positives and false-negatives and make an attempt to optimize the area under the receiver-operator curve (ROC) for the computational tool. However, there are well known methods for improving statistical predictions of this nature that are independent of the absolute false-positive and false-negative rates. These methods include filtering the population to which a test will be applied. By applying a test to smaller populations that only contain molecules appropriate for the specific application at hand, the negative impact of the false-positive rate on the predictive results can be dramatically improved.

A familiar example from the medical world will serve to illustrate this principle. Assume we have a test for the presence of the new foo virus which has an exceptional ROC curve with false-positive and false-negative values (1/1,000 and 1/1,000 respectively). Let us assume that the foo-syndrome, caused by the foo virus, effects 1 person in 20,000. If we gave this test to 100,000 people from the general population, we would expect 5 to actually have the foo syndrome. With this test, there is only a 0.05% percent chance that any of them would not be detected (i.e. be a false-negative). However, we would expect there to be 100 false positive test results. Thus of the 105 total positive test results, only 4.8% would actually have the foo syndrome (positive predictive value = 4.8%).

Confusion table for the unfiltered foo virus test (prevalence 1 in 20,000)

Actual Positive

Actual Negative

Predicted Positive

True Positives = 5

False Positives = 100

Positive Predictive Value = 4.8%

Predicted Negative

False Negatives = 0

True Negatives = 99,895

Alternatively, we could start by using very simple screening before applying the test. We first eliminate people who do not have any risk factors for contracting the foo virus. Next we may eliminate people whose blood is incompatible with the test for the foo virus. Further, we may want to eliminate people who acknowledge that they will refuse treatment for the foo virus even if we determine that they do have it. By these admittedly simple screens, we apply the test for the foo virus to a much smaller group with a decidedly higher prevalence of the virus. For instance, after the filtering, we may be left with a group of only 1,000 people who have a 1 in 200 chance of having the syndrome. Now, we still have the same 5 people who actually have the disease, but we only expect 1 false positive test. Suddenly, there are 6 total positive tests, and 83% of them actually have the syndrome! This is reflected in a much more reasonable (83%) positive predictive value.

Confusion table for the filtered foo virus test (prevalence 1 in 200)

Actual Positive

Actual Negative

Predicted Positive

True Positives = 5

False Positives = 1

Positive Predictive Value = 83%

Predicted Negative

False Negatives = 0

True Negatives = 994

Bringing the discussion back to drug design. If we have a ligand-based design tool such as ROCS, we can imagine that the receiver-operator curve may have a false positive rate as low as 1 in 10,000. For this exercise, let’s assume no false negatives. When using that to identify 50 inhibitors from a database of 2.5 million available compounds, we’d identify 300 potential inhibitors, and 5 out of every 6 of these would be a false positive (positive predictive value of 17%)! If we first run filter and eliminate 65% of the 2.5 million compounds, this leaves us with 875,000 compounds to push through ROCS. There will be about 88 false positives to go with the 50 true positives and the positive predictive value will increase over two-fold with relatively little work.

Confusion table for the unfiltered ROCS virtual screen

Actual Positive

Actual Negative

Predicted Positive

True Positives = 50

False Positives = 250

Positive Predictive Value = 17%

Predicted Negative

False Negatives = 0

True Negatives = 2,499,700

Confusion table for the filtered ROCS virtual screen

Actual Positive

Actual Negative

Predicted Positive

True Positives = 50

False Positives = 88

Positive Predictive Value = 36%

Predicted Negative

False Negatives = 0

True Negatives = 874,862

Filtering Principles

The same principle of increasing positive predictive value by removing obvious true negatives applies to screening for lead candidates, regardless of whether it is virtual screening or high-throughput screening. While both are reasonable screens, each can be plagued by very low positive-predictive values (despite low false-positive rates), particularly when applied to all available compounds, or large virtual libraries. Simple filtering techniques focus the set of compounds passed on to more computationally intensive screening methods.

The first approach to consider is filtering based on functional groups. Generally speaking, there are toxic and reactive functional groups that you simply do not want to consider (alkyl-bromides, metals etc). There are also functional groups that are not strictly forbidden, but are not desired in large quantities. For instance, parafluoro-benzene, or trifluoromethyl have specific purposes, but heavily fluorinated molecules can be eliminated.

Beyond simple functional group filtering, you can consider both simple and complex physical properties which can be used to characterize the kinds of compounds you would like to keep and those you would like to eliminate. These properties attempt to consider “drug-likeness”, such as bio-availability, solubility, toxicity, and synthetic accessibility even before the primary high-throughput or virtual screening, which primarily are geared toward detecting potency alone. The best known of the physical property filters is Lipinski’s “rule-of-five”, which focuses on bioavailability [Lipinski-1997]. However, many other physical properties, such as solubility, atomic content, ring structures, and surface area ratios can also be considered. FILTER provides algorithms for calculating many of these properties, and applying them with filters based on literature studies.

Finally, you should eliminate the types of compounds that can be troublesome at later stages. For instance, Shoichet’s aggregating compounds often produce false positives that can waste enormous resources if they were identified by virtual or high-throughput screening [McGovern-2003] [Seidler-2003] . Similarly, dyes can appear to be inhibitors by interfering with colorimetric or fluorometric assays or binding non-specifically to the target protein.

Variations of Filters

Different types of filters are appropriate under different circumstances. Very early in a project, when little or no SAR is available, very strict drug-like filters can be applied. This prevents a project team from spending chemistry resources pursuing difficult compounds that may not be modifiable to introduce appropriate properties. However, when considering compounds for purchase for HTS, different filters can be applied. Oprea, et al, pointed out that the best molecules for initial HTS are smaller and less functionalised than drugs, but with some activity [Oprea-2000]. Therefore, strict lead-like filters can be applied to ensure that hits identified from HTS have sufficient “room” for elaboration into (usually larger and more highly functionalised) leads. However, when SAR suggests that particular compounds or series may yield valuable information, filtering criteria can be loosened, because the secondary screens (QSAR models, similarity to known actives) that are being applied are effective in detecting useful compounds. Reflecting back on the medical analogy, this is the case where an improved primary screen with a dramatically improved false-positive rate (say 1 in 100,000) can be safely applied to a larger population without terrible effects on the positive-predictive value.

FILTER provides the following filters:

  • BlockBuster: The BlockBuster filter is based on 141 best-selling, non-antibiotic, prescription drugs. We designed the physical property portion of the filter so that it passes all of the compounds. The physical property values in this filter are quite good. However, the functional group filters in this filter are probably too restrictive because 141 compounds is not sufficient to span all acceptable functionality.

  • Drug: The original Drug filter ([Oprea-2000]) is provided as well. However, experience has shown it to be too restrictive. The BlockBuster filter was developed in response to complaints about the Drug filter being too restrictive.

  • Fragment

  • Lead: The Lead filter corresponds to the [Oprea-2000] lead-like filters useful for preparing HTS screening databases.

  • PAINS: The PAINS filter is based on [Baell-2010] which describes work to identify and filter promiscuous (non-specific) actives across a number of screening types and targets. Unlike the other filter types, the PAINS filter consists only functional group filters; no physical property filters are included. In practice it may be desirable to combine the PAINS filter with a separate user-defined set of physical property filters.

Hint

We recommend the BlockBuster filter, the default, for most purposes. If your project is unusual, or you are unsatisfied with the results we recommend you review the filter file for your specific filtering needs.

If you decide to modify the filter file the depictions found in the Functional Group Rules section can be particularly helpful in determining what functional groups are indicated by each name in the file.

Accumulation of Rules

FILTER contains numerous rules that judge the quality of molecules on many different facets. When examined individually, each of these rules seems quite reasonable and even profitable. However, when each molecule is tested against hundreds of individual filters, the fraction of molecules that pass all the filters can be surprisingly small. Sometimes less than 50% of vendor databases pass the filters. If this is unacceptable we recommend you examine the predicted aggregator, solubility, and Veber filters. In our experience, these are the most common failures. The best method of investigating failures is looking at the filter log.

The BlockBuster filter was adjusted to demonstrate this in a tangible way. For each value, rather than spanning the entire range, its properties were set to cover from the 2.5th percentile to the 97.5th percentile. The differences between the original BlockBuster filter and the adjusted filter are both in reasonable ranges. For instance, the full range of molecular weight for the BlockBuster filter spans 130 to 781, while the 2.5th percentile is 145 and the 97.5th percentile is 570. The remarkable result is that when the reduced filter is used, only 75 of the 141 original molecules pass the filter! This demonstrates how slight changes to many filters can lead to a significant reduction in the number of compounds that pass all of the filters.

Taking the opposite approach of allowing everything to pass can be equally futile. To demonstrate this a filter was designed around the “small molecule drug” file available from the DrugBank website. In order to pass every one of these molecules, including some that would not be acceptable for modern project work, many of the individual filters must be set to unreasonable values. For instance, the molecular weight range is 30 to 1500 and the hetero-atom count range is 1 to 60!

Hint

OpenEye can not magically divine the needs of every project. You should personally inspect the filter file. The depictions found in the Functional Group Rules section can be particularly helpful in discerning what functional groups are desirable and undesirable.

Filter Preprocessing

Before the applying any of the molecular property filters a preprocessing step occurs that can alter the molecule significantly to fit the criteria needed for most modeling applications. This filtering preprocessing step is a precisely defined series of stages that occur on the molecule in the following order:

  1. Metal Removal

  2. Salt Removal

  3. Canonicalization

  4. pKa Normalization

  5. Normalization

  6. Reagent Selection

  7. Type Checking

  8. MMFF94 Atom Type Checking

Metal Removal

Metal removal is the first stage of elemental based filtering. This stage will remove specified metal complexes from the molecule. It will not reject a molecule for having a metal complex. This allows the filter to treat atoms in the counter-ion portion of a molecule separately from the atoms in the primary portion of the molecular record.

For instance, this allows organic molecules that are complexed with silver to be eliminated based on their metal chelate even though they themselves are acceptable while at the same time eliminating a sulfate counter-ion from another molecule before it leads to elimination of the acceptable cationic molecule.

Salt Removal

This step deletes all atoms that are not part of the largest connected component of a compound. This effectively eliminates all non-covalently bound portions of the compound.

Canonicalization

This step canonicalizes the atom and bond order of the parts of the molecule that are left after the previous removal steps. This is necessary to avoid different atom orderings producing slightly different normalizations in the following normalization steps.

pKa Normalization

pKa Normalization uses a rule-based system to set the ionization state of input molecules. If pKa normalization is turned on, the molecule is set to its most energetically favorable ionization state for pH=7.4. The rule-based nature of this calculation allows it to be very fast. Further, despite being rule-based, this approach takes into account many secondary charge interactions.

While more advanced levels of theory can be found for predicting ionization states, this method is very well suited to virtual-screening database preparation. However, this may not be appropriate for hit-to-lead or lead optimization.

Normalization

In addition to pKa normalization, FILTER allows any number of additional molecular normalizations. Since normalizations are usually specific to a particular company or site, FILTER provides the ability for users to input normalizations, such as the nitro tautomer state, but does not provide default implementations.

Reagent Selection

Reagent selection for small linear library synthesis or large combinatorial library synthesis is still a necessary task at many pharmaceutical companies. For a user hoping to identify a set of acyl-halide reagents, they can specify a selection parameter to require that each compound have exactly one acyl-halide. In addition they might want to modify the filter to exclude functional groups (such as primary amines) that may be acceptable for typical lead-like molecules, but are not acceptable for the specific reagent the user has in mind.

Therefore, the selection parameter is the reverse of a filtering parameter. The molecule must include the given substructure in order to pass the filter.

See also

The select parameter in filter files.

Type Checking

This checks the valence state and formal charge of the entire molecule. The check identifies molecules that are poorly specified, or represent nonsensical chemical states, often from corrupt input data. For example, an oxygen with eight hydrogens attached or a carbon with a +9 formal charge would be rejected.

MMFF94 Atom Type Checking

This checks that all the atoms of the molecule have valid MMFF94 atom type assignments. The check identifies molecules that will fail downstream processing that depends on MMFF94 atom types (e.g. Omega).

Molecular Properties and Predictors

FILTER provides a range of properties to predictors to be used as molecular filters. Molecular properties are distinct physical properties that can be measured such as the following:

There have been several attempts at developing fast-approximate QSAR models for bioavailability that have been published. The first of these was Lipinski’s work ([Lipinski-1997]) and it has been followed by work at Pharmacopia [Egan-2000], Abbott [Martin-2005], and GSK [Veber-2002]. The simplest and probably most trusted is the work of LogP and PSA used by Egan. The most recent work by Martin, the Abbott Bioavailability Score (ABS) appears to be a refinement of the first generation models and is designed specifically to categorize a molecule’s probability of having a bioavailability>10% in rats.

See also

The Pharmacokinetic Predictors section in the Filter Files chapter.

Structural and Chemical Features

There are a number of important structural and chemical features of molecules that it is desirable to limit for virtual screening. The following simple measures are provided as filterable properties:

  • Molecular weight

  • Ring count

  • Ring-system size

  • Size of non-ring structures

  • Length of unbranched chains

  • Hetero-atom fraction

  • Halide fraction

  • Formal charges

  • Rotatable bonds

It also includes slightly more complex algorithms for hydrogen-bond donors and acceptors as well as chiral centers.

See also

The Basic Properties section in the Filter Files chapter.

Functional Groups

Functional group removal remains at the heart of the filter algorithm. Functional groups fall into several categories including:

  • Reactive or labile groups

  • Undesirable groups

  • Generally acceptable groups

  • Protecting groups

  • User-derived groups

While reasonable defaults are provided for each functional group in all of the above categories, it is unusual to find an experienced computational or medicinal chemist who agrees with all of the default values. Examining the filter file at least once is strongly encouraged. Developing a custom filter to fit a desired design is also strongly encouraged. The filter files were designed as basic guides.

See also

The filter files contain only functional group names. If there is any confusion regarding the definitions, please refer to the Functional Group Rules section for complete descriptions.

Dyes

Years ago, many high-throughput assays could be interfered with by colored molecules. While assay technology has continued to advance and often this is not a problem, it is recognized that most molecules that are dyes are not the type of molecules that are commonly carried forward in lead-development projects. Therefore, a pattern-based filter for dye molecules is included. While these patterns occasionally identify molecules that are considered acceptable by some users, in general, they identify molecules that the majority of chemists would rather not see at the top of their virtual screening hit lists.

LogP

The XLOGP algorithm ([Wang-R-1997]) is provided because its atom type contribution allows calculation of the XLogP contribution of any fragment in a molecule and allows minimal corrections in a simple additive form to calculate the LogP of any molecule made from combinations of fragments. Further, although the method contains many free parameters, its simple linear form allows for ready interpretation of the model, and most of the parameters in the model make rational sense.

Unfortunately, the original algorithm is difficult to implement as published. First, the internal hydrogen bond term was calculated using a single 3D conformation. It was found that this was both arbitrary and unnecessary. This arbitrary 3D calculation has been replaced with a 2D approach to recognize common internal hydrogen bonds. In tests, this 2D method worked comparably to the published 3D algorithm. Next, the training set had a few subtle atom type inconsistencies.

Both of these problems were corrected and refit to the original XLOGP training data. This implementation gives results that are quite similar to the original XLOGP algorithm, so it is called OEXLogP to distinguish it from the original method.

See also

LogS

The work of [Yalkowsky-1980] at Arizona has resulted in what is now called the “generalized solvation equation.” It states that the solubility of a compound can be broken into two steps, first the melting for the pure solid to pure liquid and second, the phase transfer from pure liquid into water. For many small organic molecules, this second step is somewhat related to LogP.

Because of this relation we choose to explore the use of the XLogP atom-types in solubility prediction. The expectation was that this might provide an approximate though robust and fast method for calculating solubility. We fit the XLogP atom-types to a training set of nearly 1000 public solubilities. From this we derived a linear model for solubility. The model is extremely fast and is useful for classifying compounds as insoluble, poorly soluble, slightly soluble, moderately, soluble or very soluble. The model is notable for the difficulty it has predicting solubilities for compounds with ionizable groups. Further, it is not suitable for the PK predictions that come late in a project. However, it is useful for eliminating compounds with severe solubility problems early in the virtual-screening process.

See also

The solubility parameter in the Filter Files chapter.

Polar Surface Area

Topological polar-surface area (TPSA) is based on the algorithm developed by Ertl et al [Ertl-2000]. In Ertl’s publication, use of TPSA both with and without accounting for phosphorus and sulfur surface-area is reported. However, evidence shows that in most PK applications one is better off not counting the contributions of phosphorus and sulfur atoms toward the total TPSA for a molecule. This implementation of TPSA allows either inclusion or exclusion of phosphorus and sulfur surface area with the default being to not include it.

See also

Including phosphorus and sulfur surface area:

Warning

TPSA values are mildly sensitive to the protonation state of a molecule. If the pKaNorm parameter is false, the TPSA value is calculated using the input structure and if pKaNorm is true, the TPSA is calculated using the pKa normalized molecular structure.

See also

Lipinski and Hydrogen-bonds

The work of Lipinski ([Lipinski-1997]) introduced the application of simple filter-like rules to roughly predict late-stage PK properties, in particular oral bioavailability. Unfortunately, Lipinski’s “Rule-of-Five” has come into the common vernacular to such a large degree that some of the specific details are often lost in the commotion. There are two critical examples of this. First, Lipinski used “violation of 2 rules” to categorize compounds. In the subsequent analysis, a significant difference in the two populations of molecules was detected, however, little analysis of the importance of a single violation was done. However, many now consider a single violation to be bad and two violations to be worse. At this writing, there is no known evidence to support this. Second, Lipinski used well-codified and well-understood yet imprecise definitions of “hydrogen-bond donors” and “hydrogen-bond acceptors” in his classification model. While this makes the algorithm quite understandable and easy to implement, it sometimes causes confusion with those who prefer more refined definitions of hydrogen-bond donors and acceptors.

To address the first problem, users are allowed to set the number of Lipinski failures required to reject a molecule. In keeping with the original publication, the default value is 2. To address the second problem, two kinds of hydrogen-bond donors and acceptors are calculated. In the Lipinski calculation, the published definitions are used (donor count is the number nitrogen or oxygen atoms with at least one hydrogen attached and acceptors are the number of nitrogens and oxygens). For calculation of the number of donors and acceptors in the molecule for the sake of chemical properties, a more complex algorithmic approach is taken. This approach identifies the donors and acceptors outlined in the work of Mills and Dean ([Mills-Dean-1996]) and also in the book by Jeffrey ([Jeffrey-1997]).

See also

The Lipinski violations and hydrogen-bond acceptors sections in the Filter Files chapter.

Aggregators

The Shoichet lab ([McGovern-2003], [Seidler-2003]) has demonstrated the importance of small-molecule aggregation in medium and high-throughput assays. Because these small-molecule aggregates can sequester some proteins, they give the appearance of being active inhibitors. There are now several hundred published aggregators in addition to a published QSAR model for predicting aggregation propensity. The user has the ability to eliminate any of the known aggregators as well as the ability to eliminate compounds that are predicted to be aggregators using the QSAR model. In OpenEye’s experience, the published QSAR model for predicting aggregators is quite aggressive. It occasionally identifies compounds that are known to be genuine small-molecule inhibitors of a specific protein. While in most cases this model can be useful, until more definitive work is published, we feel you should gain some experience with the model and judge its performance for yourself.

More recent work in industry indicates that in some cases, aggregation properties are specific to the particular experimental conditions being used. Thus we recommended caution in the interpretation of these predictions. Nevertheless, aggregation remains an important issue in HTS hit follow-up and, short of experimental validation, this flag may be the best-available.

See also

The aggregators parameter in the Filter Files chapter.

PAINS

The PAINS filters ([Baell-2010]) are a set of functional group filters created to identify and filter promiscuous (non-specific) actives across a number of screening types and targets. The original PAINS paper includes a set of functional group prefilters, created via two iterations of their work, and three sets of structure class filters, called filter sets A, B, and C, which were used to identify and remove promiscuous actives. The OpenEye PAINS filter set is a combination of all four filter sets (the functional group prefilter set plus the three structural class sets) into a single filter set.

There are two distinct use cases for the PAINS filters. The first is the typical prefiltering of structures prior to screening or compound selection to remove undesirable molecules. One can simply eliminate any molecules which fail the PAINS filtering.

The second use case, which may ultimately be more interesting, is to use the PAINS filters retrospectively on a set of data which has been filtered (and tested) using another methodology. In this scenario the PAINS filters can be used to identify potentially problematic reactive functionality in otherwise reasonable, drug-like structures. By using the table output mode and identifying matches for the A, B, and C filter sets, one can identify fairly specific chemical functionalities that have been known to result in non-specific activity. This can be useful information for either prioritizing further lead development or for modifying lead structures to mitigate adverse reactivity.

There are approximately 650 SMARTS rules which have been converted from the SLN query strings provided in the supplemental information for the original paper. For the functional group filters, the authors do not provide identifying names for each individual filter, so our PAINS rules are named based on the comments used to describe each functional group class, prefixed with the string pains_fg_[classname]. For the structural class filters, the authors provided unique RegID values for each filter. For those cases our rules are named based on the PAINS group and RegID as pains_[group]_[RegID].

Note that in certain cases it isn’t possible to exactly represent the SLN queries given in the paper with a single SMARTS, so in those cases the rules have been split out into multiple SMARTS patterns and are named pains_[group]_[RegID]1, pains_[group]_[RegID]2, etc.

Hint

Unlike the other filter types, the PAINS filter consists only of functional group filters; no physical property filters are included. It may be desirable to combine the PAINS filter with a separate user-defined set of physical property filters. The section filter file describes creating additional filter rules.