Background

Structure Indexing

The cartridge provides domain indexes for columns that contain structures encoded in Daylight SMILES, MDL MOL (or SDF) blocks; reactions encoded as Daylight SMIRKS or MDL RXN blocks; and multi-conformer molecules encoded in OEBinary arrays. This is achieved by using an external Java RMI service to perform substructure, similarity and exact match chemical searches. The RMI service uses the Java wrappers of the OEChem toolkit to assist in SMILES parsing and chemical structure handling. The Oracle JVM is used to call out to the RMI service. A domain index can be created for any table column containing structural information: a varchar2 column containing SMILES or SMIRKS, a clob column containing a MDL MOL or RXN block, or a blob column containing an OEBinary byte array. This index can be uniquely named such that index_name = owner.table.column. On index building the RMI server connects to oracle and extracts structures by Oracle ROWID. The external index is built as a hash keyed by ROWID and with a data structure value which contains a fingerprint for the structure and the SMILES. The index is then cached locally as a serialized Java object. Although structure indexes can be created for a variety of structure encodings, index creation and manipulation is much faster for indexes built on SMILES columns.

The index tracks edits to the base table using a change log (or journal table) that contains ROWIDs and new values. Before performing searches the domain index loads new entries from the change log into the index. An index rebuild saves these changes and empties the log table. Substructure search is achieved by fingerprinting a query SMARTs pattern (or MDL Mol or Rxn query block) and screening out against fingerprints in the index. Those structures which test against the fingerprint are searched exhaustively for a match. Similarity search is done by fingerprinting a molecule and determining Tanimoto coefficients for target structures. Database molecules that exceed a minimum similarity are returned. Exact match search is done by filtering structures by fingerprint similarity first, and then comparing the canonical SMILES of the query and target structures.

The substructure search requires the creation of OEMol structures. These can be cached on the RMI server in a LRU (least-recently used) cache. They are not currently stored in Oracle and must be regenerated each time the RMI server starts. If enabled, the molecule cache can retain OEMols that it generates for search or all molecules in an index can be added to the cache. A performance benefit will be realized if the server has sufficient memory to cache all the structures in the database.

Fingerprints

The fingerprints used in the data cartridge are an implementation of the MACCS 166 keys. These are used for both screenout during substructure searches and in the calculation of molecular similarity for similarity searches.

When performing a similarity search, the query molecule is fingerprinted using the full set of MACCS 166 keys and the similarity of each database molecule to the query molecule is calculated by comparing their fingerprints using the Tanimoto equation.

When performing a substructure search, the query substructure is fingerprinted using a subset of the MACCS 166 keys. Each database molecule fingerprint is compared to the query substructure fingerprint to determine whether the query substructure is possibly contained within the database molecule. If it is, then a more time consuming atom-by-atom comparison is carried out to either confirm or deny the match. A subset of the keys is used to fingerprint substructure queries as there are many keys in the MACCS 166 keyset that represent features that cannot be assigned unambiguously in a substructure - as opposed to a complete molecule (e.g. feature no. 34 A=CH2).

A separate index type is available for performing similarity searches using OpenEye Graphsim fingerprints. Each index of this type consists of a memory resident fingerprint database. Tanimoto and Tversky similarity are computed from fingerprints using the following formulas:

Formula:
Sim_{Tanimoto}(A,B) = \frac{bothAB}{onlyA + onlyB + bothAB}

and

Formula:
Sim_{Tversky}(A,B) = \frac{bothAB}{\alpha * onlyA + \beta * onlyB + bothAB}

Shape

An additional index type is available for performing similarity searches on FastROCS shape databases derived from structure and conformer columns in database tables. When an index is created, the RMI server creates an external file of persistent multi-conformer OEMols. If the RMI server is running on a machine with GPU hardware and configured to run FastROCS, the external file is loaded into a shape database. For 2D columns (varchar2 columns containing SMILES or clob columns containing SDF entries), multiple OMEGA jobs are used to create the persistent muti-conformer OEMols. Since creating multi-conformer molecules using OMEGA is time-consuming, on most machines, any FastROCS index created from 2D compounds will likely be limited to a smaller number of compounds. For example, using a quad core workstation with Intel Xeon 3GHz processors, it took 2hrs 18m to build a FastROCS index of 9M conformers on a table of 100,000 SMILES and 11hrs 9min to build a FastROCS index of 44 conformers on a table of 500,000 SMILES structures. Multi-conformer molecules may be stored in Oracle as an OEBinary byte array in a blob column. For FastROCS indexes built on 3D columns the external index type uses the multi-conformer compounds stored in the database table and thus the index can support any size of table. For 2D columns on update new values are added to the FastROCS server by creating a multi-conformer OEMol using the OMEGA toolkit. For 3D columns the column value can be added unmodified to the FastROCS server. In order to save space and improve performance, it is recommended that OEMols be stored in rotor compressed format.

Table Of Contents

Previous topic

Introduction

Next topic

Quick Start