MaaS Database Preparation

Prerequisite

The steps below presume the Maas Client (maascli) has been installed per Requirements.

Input files

MaaS prep requires one or two input files, depending on whether you want to include conformers in the database.

Note

Note that at the scale of 1B compounds, conformers are not currently supported.

The primary input file should be a SMILES (.ism) file containing the molecules of interest. There are a few requirements:

  1. Titles must be included. MaaS uses titles for many cross-referencing features.

  2. Titles must be unique. If more than one entry has the same title, all subsequent molecules with that title will be omitted from the database.

  3. Molecules should be as registered/expected by chemists. If a molecule is registered without stereochemistry, DO NOT enumerate them in this file. Stereo enumeration should be done on the input to Omega for generating conformers. See below.

Conformers

If conformers are to be included, they should be created with the same titles as the SMILES file. Omega requires undefined stereo centers to be enumerated before conformers can be created. This is done using the Flipper app (documentation).

As Flipper enumerates stereoisomers, each will retain the same title as the input file. This means that we can associate all the various stereoisomers in the conformer file with the matching racemic molecule in the SMILES file.

_images/maas_prep.svg

Basic workflow. Files (1) and (3) would then be the input to MaaS Prep.

Running prep

maascli database prep has a number of options, but in most cases, the defaults are designed to give the best results. (For very large DBs like Enamine REAL, see notes below)

$ maascli database prep --help
Usage: maascli database prep [OPTIONS] ID TITLE INPUT MAASDB

Prepare a new database for maas

ID     : Unique ID, string with no spaces or special chars
TITLE  : Title for this database, visible in the UI
INPUT  : Input SMILES filename
MAASDB : Output .maasdb filename

Options:
--version VERSION               Version string to label database
--confs CONFS                   OEZ or OEB file with conformers for same
                                molecules as INPUT
--sss [all|mdl|smarts]          Which fast substructure files to create.
--fps [all|circular|circularvs|tree|treevs|path|pathvs]
                                Which FP types to create.
--numbits [512|1024|2048|4096]  Number of bits for fingerprint creation
--tautomerType [reasonable|none]
                                Tautomer normalization method.
--progress [bar|stderr|log|none]
                                Progress output style: [bar, stderr, log,
                                none]  [default: bar]
--progressDelta FLOAT           Delta time for progress output. (for styles
                                stderr and log)  [default: 60.0]
--force                         Overwrite existing files.

There are four required parameters, ID, TITLE, INPUT and MAASDB. ID has to be unique but this then allows access to this database via the API using this ID. TITLE will be shown in the UI, so can be as detailed as needed. Make sure to put it in quotes if there are any spaces or special characters. If not provided, VERSION will be set to a date stamp for the current day (e.g. 2020-08-17).

In this example, we will also add the –confs flag to include the .oez files of conformers.

$ ls
emolecules.ism       emolecules_confs.oez

$ maascli database prep emolecules "eMolecules" emolecules.ism emolecules.maasdb --confs emolecules_confs.oez
maas prep
  ID: emolecules
  Title: eMolecules
  Input: /Users/bob/dev/git/maas-server/foo/emolecules.ism
  Output: /Users/bob/dev/git/maas-server/foo/emolecules.maasdb
  Protomer canonicalization: reasonable
   Confs: /Users/bob/dev/git/maas-server/foo/emolecules_confs.oez
  Including FP types: (4096 bits)
    Path
    Circular
    Tree
    PathVS
    TreeVS
    CircularVS
  Including Substructure search types:
    MDL SubSearch Screen
    SMARTS SubSearch Screen
Creating title map       [============================================================] 100.00%  00:00:00
Creating                 [============================================================] 100.00%  00:00:01
Writing circular         [============================================================] 100.00%  00:00:00
Writing circularvs       [============================================================] 100.00%  00:00:00
Writing path             [============================================================] 100.00%  00:00:00
Writing pathvs           [============================================================] 100.00%  00:00:00
Writing tree             [============================================================] 100.00%  00:00:00
Writing treevs           [============================================================] 100.00%  00:00:00
Writing SMARTS           [============================================================] 100.00%  00:00:00
Writing MDL              [============================================================] 100.00%  00:00:00

$ ls
emolecules.ism                       emolecules.maasdb-pathvs.fpbin       emolecules.maasdb-tree.fpbin
emolecules.maasdb                    emolecules.maasdb-sss-MDL.oeb        emolecules.maasdb-treevs.fpbin
emolecules.maasdb-circular.fpbin     emolecules.maasdb-sss-MDL.oeb.idx    emolecules.maasdb.json
emolecules.maasdb-circularvs.fpbin   emolecules.maasdb-sss-SMARTS.oeb     emolecules_confs.oez
emolecules.maasdb-path.fpbin         emolecules.maasdb-sss-SMARTS.oeb.idx maas_prep.log