What do we mean by shape? The word is often used without consideration
of precise meaning but in this document we shall be very clear as to
the definition of shape. Two entities will have the same shape if
their volumes exactly correspond. The more the volumes differ, the
more the shapes will differ. We will give a precise mathematical
exposition below, but it is worth noting even at this most basic level
shape is defined as a relative quantity, depending on references to
other shapes. In this we differ from approaches that attempt to
provide absolute, canonical, *shapes* by which to categorize
molecules.

What do we mean by *volume*? A volume is any scalar field. This means a
function that has a single number, or *scalar*, value at each point in
space. The *special case* for the common understanding of volume is a
specific scalar field that has a value of one inside an object and zero
outside. The volume of a scalar field is:

\[V \mbox{(volume)} = \int f(x,y,z) dv\]

The volume function, **f**, is also referred to as the *characteristic*
function. When the characteristic function corresponds to the common
definition of a volume field this integral corresponds to what is commonly
expected by volume. However, we are not restricted to such simple functions
and can still calculate a **V**. In general the volume of a scalar field is a
*contraction* of the information represented by that characteristic
function. It is more precisely referred to as the zeroth-order contraction,
or *moment*. We will discuss other moments and their uses later, but one
immediate observation is that two objects can not have the same shape if their
volumes are not the same. The converse is obviously not true. Rather, two
objects can have the same volume and not have the same shape. Volume is
typical, therefore, of most contractions of information.

We can now write down a precise definition of shape similarity. Consider the integral:

\[S_1 = \int |f(x,y,z) - g(x,y,z)| dV\]

where **f** and **g** are different characteristic functions. If this
integral is zero then **f** and **g** are actually the same function and
therefore correspond to the same shape. The larger the integral, the
more different the shapes defined by **f** and **g**. It defines a metric
quantity between the two fields **f** and **g**. The word *metric* is
used loosely to mean *shape*, but here we mean the precise mathematical
definition: *i.e.* a distance that is 1) always positive, 2) zero if
and only if two entities are identical and 3) that obeys the triangle
inequality. The triangle inequality states that if entity A is
distance x from entity B and B is distance y from entity C then the
distance between A and C is bounded by ***|x-y|*** and ***|x+y|***. The type
of comparison shown in S_{1} is referred to as an L_{1}
metric. Another metric is the S_{2} metric:

\[S_2=\sqrt{\int [f(x,y,z)-g(x,y,z)]^2 dV}\]

Multiplying the terms in the integral out gives:

\[S_2^2 = \int f(x,y,z)^2dV + \int g(x,y,z)^2dV - 2\int f(x,y,z)g(x,y,z)dV\]

This is the fundamental equation for shape comparison. We rewrite it as:

\[S_{f,g} = I_f + I_g - 2O_{f,g}\]

The *I* terms are the self-volume overlaps of each entity (for our purposes -
molecule), while the *O* term is the overlap between the two functions. They
constitute the three terms we need to compare the shapes of two
fields. The *I* terms are independent of orientation but not *O*. Finding the
orientation that maximizes *O*, and hence minimizes *S_{f,g}*, is equivalent to
finding the best overlay between the two objects (a quantity that has its own,
distinct metric properties). We also note here that the quantity referred to
as a Tanimoto coefficient may be derived by recombining *I*‘s and *O* so:

\[Tanimoto_{f,g} = \frac{O_{f,g}}{I_f+I_g-O_{f,g}} \label{Tanimoto}\]

Tanimoto coefficients will be familiar to those who use them for bitvector fingerprint comparison. An alternative measure is the Tversky coefficient, also mostly used for similarity between bitvector fingerprints. Similarly to the Tanimoto coefficient above, we can define a shape Tversky measure. The base equation for the Tversky coefficient is:

\[Tversky_{f,g} = \frac{O_{f,g}}{O_{f,g}+\alpha I_f+\beta I_g} \label{tversky}\]

Normally, *alpha + beta = 1*, and for our current use, *alpha* is chosen to
be 0.95. Since this introduces an asymmetry, the Tversky calculation depends
on which molecule’s self-overlap has the *alpha* pre-factor.
ROCS calculates two Tversky values, one with the query
molecule with *alpha* as the pre-factor and a second with the database
molecule with *alpha* as the pre-factor. Also, note that since shape is
a field property, instead of a simple scalar like a bitvector, shape Tversky
can be larger than 1.0 since the overlap *O_{f,g}* can be larger than
a molecule’s self-overlap, *I_f*.

The OpenEye Shape Toolkit is a set of calculational objects designed to facilitate the calculation of these field-metric quantities. ROCS is an application built on top of the Shape toolkit.

Molecules are traditionally viewed as a set of fused spheres, sometimes
referred to as the CPK model. The common view of molecular volume is then of a
characteristic function that is one (1) inside at least one sphere and zero
(0) outside. How do we calculate the volume of such a seemingly simple
function? The volume of a single sphere is *(4pi r^3)/3* but the complication
for two fused spheres is that we have to account for the shared
volume and not count it twice. For more than two atoms, there are triple
intersections that must be added back in if we have removed the three
pairs of intersections. The general formula for N spheres that explicitly
calculates the volume of every level of overlap and its correct contribution
is:

\[V = 1 - \int \prod_i^N (1 - f_i)dv\]

This is easy to write, not so easy to solve because the analytic formulae for
overlaps of increasing order are highly non-trivial (although they have been
derived to arbitrary order). It is fair to say that this has hindered the
development of shape comparison in many ways. Attempts to use analytic
formulae led to very slow programs and approximate methods, for instance using
grids of points that are turned *in* or *out* by each sphere, do not give
smooth gradients required for minimization. Brian Masek (AstraZeneca) was the
first to attempt to optimize overlaps of molecules using the analytic
approach. His program would take several minutes per minimization. In addition
it would often suffer from a common problem when using functions that vary
sharply (such as solid spheres): it would often get stuck in local
minima. Nevertheless, Brian did have encouraging success using this method to
find similarities not obvious from chemical structure.

The conceptual breakthrough in shape comparison came in 1995 in a paper by
Andrew Grant (AstraZeneca) and Barry Pickup (University of Sheffield)
([Grant-1995], [Grant-1996], [Grant-1997]). They showed that if one let
go of the concept of the characteristic function being
binary, and instead use a sum of continuous functions, *i.e.*
a Gaussian, that the solid-sphere volume, could be recovered to high
accuracy (typically ~0.1%). A sphere has one defining parameter,
its radius, whereas a Gaussian has two defining parameters, its prefactor,
*p*, and its width, *w*:

\[pe^{-wx^2}\]

Grant and Pickup found that by fixing *p* to 2.7 and setting *w* for
each atom such that the volume integral for each atom agreed with its
solid-sphere volume, they achieved remarkable precision. In addition,
the overlap terms between any two atoms, and hence any higher-order
overlaps, are all Gaussian functions themselves because of the
Gaussian Contraction formula (shown here for one spatial variable):

\[\int e^{a(x-x_i)^2}e^{b(x-x_i)^2} = \int e^{(a+b)(x-x_i)^2}\]

*i.e.* two atomic-Gaussians overlap to produce another Gaussian.
Likewise, a three atomic-Gaussian overlap is that of an
overlap-Gaussian with an atomic-Gaussian, hence another Gaussian. The
simplicity of these formulae and the formula for the volume of each
individual Gaussian leads to very efficient algorithms for the
calculation of the volume of a molecule so represented (the OpenEye
method calculates several thousand volumes per second while
calculating intersections up to sixth order).

In addition to simple calculation of molecular volume, which is the
zeroth-order moment of the characteristic function, the ease of
evaluation of intersections allows for accurate calculation of
high-order moments: called the *steric multipoles*. For
instance, if the product formulae for atomic and intersection
Gaussians yields *n* Gaussians, the first order moments are:

\[M_{1,x} = \sum_{i=1}^n \int xe^{a_i|(x-x_i)^2+(y-y_i)^2+(z-z_i)^2|}\]\[M_{1,y} = \sum_{i=1}^n \int ye^{a_i|(x-x_i)^2+(y-y_i)^2+(z-z_i)^2|}\]\[M_{1,z} = \sum_{i=1}^n \int ze^{a_i|(x-x_i)^2+(y-y_i)^2+(z-z_i)^2|}\]

These integrals are easy to solve and their sum can be set to zero by an
appropriate choice of origin: the center of *mass* for the sum of Gaussians.
Second-order moments are found from integrals of the type:

\[M_{2,PQ} = \sum_{i=1}^n \int PQe^{a_i|(x-x_i)^2+(y-y_i)^2+(z-z_i)^2|}\]

where P and Q are chosen from (x,y,z), *e.g.* x^{2}, xy etc.

These moments can be thought of as a symmetric 3*3 matrix which we
refer to as the *mass matrix*. Rotating or translating the molecule
will change the moments and the transform that sets the first-order
moments to zero and diagonalizes the mass-matrix puts the molecule
into its *inertial frame*. By convention we assign the x-axis to the
largest eigenvector of the mass matrix, y-axis to the median
eigenvector, z-axis to the smallest. Note that this orientation is
still not uniquely defined: 180 degree rotations around any axis also
diagonalize the mass-matrix. The eight (2^{3}) possible transforms
that can be generated by combinations of such rotations actually lead
to four unique inertial orientations.

If a molecule is aligned to its inertial frame, all higher-order
steric multipoles become invariant, ignoring certain sign-changes from
the four-fold degeneracy of the inertial frame. As such they, as well
as the second-order moments, are shape *descriptors*. They are still
contractions of the information contained in the characteristic field,
*i.e.* two molecules can have the same steric moments and yet have
different shapes. (Moments are *complete* in that if we calculate
them to infinite order they do exactly define volume but this is
seldom a practical approach!) Nevertheless, they do contain useful
information and can be used as a rapid, approximate, filter for shape
similarity.

The same advantages that allow for the calculation of molecular volume carry over to the calculation of molecular volume overlap. The overlap of volumes are Gaussian contractions, easily tabulated and efficiently retrieved. Andy re-wrote Brian’s program and obtained an order-of-magnitude improvement in performance as well as another remarkable observation: if the starting orientation of each molecule is that given from the inertial frame then very few “false” minima are produced. The smoothness of the Gaussian characteristic function is enough to overcome the problems with convergence in Brian’s program. The four possible “inertial” starting points were enough to find the best, global, overlay between two molecules. This observation and the Gaussian approach are the basis of the OpenEye Shape Toolkit and ROCS program for rapid shape overlay.

But note, despite the algorithmic advantages, a correlation with common perception has been lost. Because the pre-factor of each atomic Gaussian is not unity, the characteristic function does not correspond to the inside/outside description with which we are most comfortable. In the Gaussian model all points in space are to some degree inside and to some degree outside. That is, the Gaussian model typically shows about 0.1% error with respect to the solid sphere model due to the fact that is includes a portion of all points in space inside the volume.

In addition to shape-alignments ROCS, optionally, considers chemistry alignment, known as ‘color’. User specified definitions of chemistry can be included in the superposition and similarity analysis process to facilitate the identification of those compounds that are similar both in shape and chemistry.

Color atoms are described as Gaussians and displayed in vROCS as colored spheres.
The Gaussian for a color atom is relatively hard with a steep gradient.
*Figure: Hard vs. Soft Gaussians* illustrates hard vs
fuzzy Gaussians. Both Gaussians in the figure represent the same volume as
the sphere. However, the hard Gaussian, with the steep gradient, reaches
a probability of zero (0) within the radius of the sphere. The color features
are either matched, if they fall within the sphere radius, or not matched.
In the case of the fuzzy Gaussian there are areas outside the volume
of the sphere (the area under the curve indicated by the two arrows) where
the Gaussian probability is greater than zero. This would allow color features to
match even when they align well outside the sphere representing the color atom.
That situation would lead to less precise alignments and, for that reason,
the ‘hard’ Gaussian is employed.

A sphere described by two different Gaussian functions. The ‘hard’
Gaussian (dashed) is the one employed by ROCS to approximate a color atom sphere.

ROCS comes pre-loaded with two color force fields, Implicit Mills Dean (default) and
Explicit Mills Dean. These are described in color force field files (*.cff) located
in the ROCS data directory. A sample color force field file, `sample.cff`,
is also provided as a template for a user’s own custom force field. The desired force
field file is supplied to ROCS either at the command line using the `-chemff` command
or in the vROCS Preferences menu. Further information on editing color force field
files is given in the section *Color Force Field*.

The color force field is used to measure chemical similarity between the query and the database molecule and to refine shape-based overlays. The color force field file describes:

- Color atom types
- Which functional groups the color atoms should be applied to. ROCS uses only the heavy atoms of molecules, hydrogens are ignored.
- Whether the interaction between color atoms is attractive or repulsive. Interactions between color atoms of the same type are always attractive. The weight term describes the strength of the interaction, relative to the shape gradients and the range term affects the range of the interaction.

The color features described in the Implicit and Explicit Mills Dean color force field files include:

Donor: | Functional groups that can act as H-bond donors e.g. acid-OH |
---|---|

Acceptor: | Functional groups that can act as H-bond acceptors e.g. carboxylate |

Anion: | Functional groups with either localized or delocalized negative charge e.g. tetrazole |

Cation: | Functional groups with either localized or delocalized positive charge e.g. guanidinium |

Hydrophobe: | Terminal or non-terminal aromatic or aliphatic groups e.g. phenyl |

Rings: | Rings of defined size e.g. 4-7 atoms |

A custom force field file can include other features that you define e.g. positive, negative, carbonyl_linker, metal_binder. For each color atom type a set of SMARTS is used to define the specific functional groups to which the color atom will be applied. The Implicit and Explicit Mills Dean force fields differ in these functional group definitions. For example, the Explicit Mills Dean force field allows a primary amine to be an acceptor as well as a donor whereas it is a donor only in the Implicit Mills Dean force field.

The color force field can also be used for post-shape scoring either alone, e.g. ColorTanimoto and Color Tversky, or in combination with shape scores, e.g. TanimotoCombo and TverskyCombo. Some additional scores are available with color:

- ScaledColor
- ComboScore
- ColorScore

These scores are defined in the section *Report File*.