The most common use of ROCS is overlaying a large collection of molecules onto a query (reference) molecule. For the purposes of this document, we’ll call this large file the dbase (fit) file. The most common format for the dbase file is a multi-conformer OEBinary file created by OpenEye’s OMEGA program, however, this file can be one of several 3D formats. These formats include SDF, MOL2 and PDB. ROCS determines the input file format from the file extension, .sdf or .mol for SDF, .mol2 for MOL2, .pdb or .ent for PDB. Gzip compressed files of these same formats are allowed as well. ROCS will interpret infile.sdf.gz as a gzip’ed SDF file.
Note that even though all these formats are supported, using SDF or MOL2 can result in a loss of speed due to the huge I/O penalty of these formats.
ROCS has no provision for conversion of 1D/2D molecules to 3D. The input file must already be 3D. More importantly, ROCS will interpret conformers in the input file as part of a single multi-conformer molecule as long as they:
While this may appear to be a restrictive list, many programs write multi-conformer molecules into SDF or MOL2 files such that the above rules will be satisfied. If the conformers are named differently, (i.e. they have a conformer number appended to the base name like acetsali_1, acetsali_2), ROCS will still consider them part of a single multi-conformer molecule if the criteria above are met. For file formats that are not inherently multi-conformer, this behavior can be turned off with the -scdbase command-line switch. With the -scdbase switch on, ROCS will not attempt to combine multiple conformers into a single multi-conformer molecule.
A new molecule file format, specifically for ROCS on large clusters is the .rocsdb format. See the makerocsdb section for when to use this file and how to create it.
One other file type is allowed as the dbase file. A file name ending in .list or .lst is assumed to be a list of actual molecule files, one per line. ROCS will then open each in turn and treat the entire collection as a single dbase file. Note that the conformer detection/concatenation code above will not span the gaps between these separate files.
Here is an example list file:
part1.oeb.gz part2.oeb.gz part3.oeb.gz hits.mol2
The second required input for a ROCS run is a file containing one or more molecules to be used as the query. ROCS will loop over molecules read in from the dbase file and attempt to overlay each of them against the query. In order to be consistent with other OpenEye software, this query molecule can also be referred to as the reference molecule.
Normally, ROCS treats each molecule in the query file as a single conformer molecule. For each molecule in the query file, ROCS will run a complete loop over the dbase molecules and write out a hits structure file and a report file, depending on the values of other command line switches described below.
Alternatively, ROCS can read queries as multi-conformer molecules by adding the -mcquery command line switch. In this mode, ROCS uses the same rules as described in the The Database File section to determine if two consecutive molecules are actually conformers of the same molecule. For each multi-conformer molecule in the query file, ROCS will loop over the dbase molecules’ conformers comparing them to all query conformers. By default, ROCS will only return the single best overlay of this NxM set of comparisons. More than one can be returned by using the -maxconfs command line switch.
Version 3.0 of ROCS introduced a new type of query called a shape query. It is a format that encompasses multiple elements of shape, including molecules, color features and grids. It can be generated from vROCS and saved in a shape query file with the extension .sq.
ROCS can also use a grid instead of a molecule as a query ([Virtanen-2010]). These grids must be in GRASP, OpenEye, OpenEye ASCII Grid (.agd), CCP4, or XPLOR grid format and can be created with the OpenEye Grid toolkit or with a graphical application like GRASP. Certain ROCS features are not available when using a grid query. For example, the color force field features are not available with a grid query.
A description of the command line interface can be obtained by executing ROCS with the --help option.
prompt> rocs --help
will generate the following output:
Help functions: rocs --help simple : Get a list of simple parameters rocs --help all : Get a complete list of parameters rocs --help <parameter> : Get detailed help on a parameter rocs --help html : Create an html help file for this program
File containing molecules or a single grid to use a shape query.
For molecule queries, available formats (and file extensions) include:
|SDF||.sdf .mol .sdf.gz .mol.gz|
|PDB||.pdb .ent .pdb.gz .ent.gz|
For grid queries, available formats (and file extensions) include:
|Grid File type||Extension|
For shape queries, only available format and file extension is:
|Shape Query type||Extension|
File containing one or more 3D molecules to overlay against query from above. This flag supports all the same molecule file formats (not grids or shape queries) as -query plus the ROCS DB format .rocsdb and the list file (.lst or .list) as described in the The Database File section.
The argument for this flag is the name of a file containing control parameters. The control parameter file acts to either replace or augment the command line interface. All parameters necessary for program execution may be provided in the control parameter file, although any command given explicitly on the command line will supersede options found in the parameter file. The application generates a new parameter file containing the full set of execution parameters upon every execution. The name of the parameter file is created by combining the prefix base name with the ‘.parm’ extension.
Specifies the number of processors n when the application is run in MPI mode.
Specifies the name of the file containing processors configuration. For every host this file should contain a line host_name slots=n where n is the number of processors on the host.
Combine contiguous conformers in -query file into a multi-conformer query molecule, following the same rules for combining sequential conformers in the -dbase file. By default, this is false and each connection table in the -query file is treated as a separate query. Labelling the conformer by adding a wart to the name can be set using the -qconflabel parameter.
[default = false]
Prefix used to name output files. Using -prefix FOO will create a hits structure file named like FOO_hits_1.sdf and a report file, FOO_1.rpt, where _1 will be replaced by a sequential number corresponding to the index of the query in the -query file. Additionally, a parameter file containing all options for the current run will be written to FOO.parm. This parameter file can be used with the -param switch.
[default = rocs]
Search entire dbase file and keep a hitlist, sorted by score given by -rankby switch. Size of hitlist is determined by integer value N. Note that all members of the hitlist must pass the -cutoff, if given, so the final size can be smaller than the N requested. This switch is ignored if -maxhits is given. Note, that if this is set to zero (0) and -maxhits is also zero (0), then no hitlist will be maintained and all results will be streamed directly to the respective output files.
[default = 500]
Cutoff (F) to determine whether a specific overlay should be considered good enough for hitlist inclusion. This is a floating point value and the actual parameter used for the scoring is as defined by the -rankby switch.
[default = -1.0]
Score to use for ranking the hitlist. Legal values include:
[default = TanimotoCombo]
Maximum number of overlays returned for each comparison of a dbase molecule with a query molecule. This defaults to 1. As an example, if the query has n conformers and a given dbase molecule has m conformers, then a total of nxm overlays will be performed. By default, the single best one (1) will be returned (if it passes any -cutoff given). Choosing an alternate value for -maxconfs will cause up to the top N of these overlays to be returned and merged into the hitlist. In the hitlist, these conformers will not be associated with each other. Throughout a run, some can drop off the hitlist while others remain.
[default = 1]
Controls where the conformer index from a database molecule gets labeled on output molecules. The allowed values are none, title, sdtag, and both.
[default = title]
Controls whether the conformer index from a query molecule gets labeled on output molecules. The allowed values are none and title.
[default = title]
Put the query structures at the top of the output structure file. This is very useful for keeping the query structure in the same file as the hits, so that for instance, you only need to load one file into VIDA to browse the results. For a grid query, a copy of the grid will be written to PREFIX_ref.grd, where PREFIX is defined by the -prefix commandline option.
[default = true]
Don’t write a structure file. There are times when all you really want are the numerical results from ROCS. If you don’t want or need an output structure file, you can prevent its creation with this switch.
[default = false]
Instead of writing to PREFIX_hits_n.sdf (for example) where PREFIX is provided by the -prefix commandline flag, write all hit structures to the file provided with this flag. Can be a filename or full/relative path. Also, if the name provided is actually a molecule file format extension (i.e. .sdf, .mol2.gz, .oeb, etc.), ROCS will write to stdout using the format derived from the file extension. For example if the following is used:
then ROCS will write all the hits out to stdout in SDF format.
Note that this option will only work for a single query. If more than one query is provided along with the -hitsfile option, ROCS will issue an error and stop.
Format for the output structure file(s). This option gives a file extension to be used for all output structure files. The format for the file is determined from the extension. Valid values include all the molecule file formats listed in the table above for -query files.
[default = oeb.gz]
This parameter controls whether to attach score information to output molecules as SD data.
Instead of writing to PREFIX_n.rpt where PREFIX is provided by the -prefix commandline flag, write all report information (stats) to the file provided with this flag. Can be filename or full/relative path. Note that if more than one query molecule is provided, this flag will not work unless the -report flag is also set to one to put all report info into one report file.
Controls report file generation. The default, each, writes a separate report file for each query in the -query file. If one is chosen, stats for multiple queries in the same -query file will be placed in a single report file. This is useful for computing a NxN comparison of a file as both the dbase and query. Finally, to prevent ROCS from writing report files, use none.
[default = each]
Determines which stats get placed into report files. Values include hits (the default), best and all. The default is to include just the stats for the compounds in the hitlist. If best is chosen, the report file will include stats for the best overlay(s) for every dbase molecule. The number of best overlay score is determined by the value of -maxconfs. Finally, if all is given, stats for every single overlay will be placed in the report file. Be careful. For a multi-conformer query against a large dbase file, all can generate a HUGE amount of data.
[default = hits]
Instead of writing to PREFIX_n.status where PREFIX is provided by the -prefix commandline flag, write all status information to the file provided with this flag. Can be filename or full/relative path. Note that if more than one query molecule is provided, the status written to this file is only for the most recently processed query.
Controls status file generation. The default, each, writes a separate status file for each query in the -query file. If one is chosen, the status for multiple queries in the same -query file will be placed in a single status file. Note that if more than one query molecule is provided, the status written to this file is only for the most recently processed query. Finally, to prevent ROCS from writing status files, use none.
[default = each]
Method for showing job progress on the command line. Choices include:
Add extra verbosity to log file.
Flag that can be used to limit output hits to only those with some minimum shape score. This can be used regardless of which score is chosen (-rankby) for ranking the hitlist. For example, using:
-rankby TanimotoCombo -cutoff 1.1 -tanimoto_cutoff 0.6
any molecule with Shape Tanimoto <= 0.6 will not be retained. Additionally, the constraint on TanimotoCombo to be at least 1.1 implies that Color Tanimoto must also be > 0.5 so that the sum can be greater than 1.1.
Specifies number (N) of random starting positions to try instead of inertial frame overlay as described in the theory section. Since inertial frame alignment involves 4 (or 8 in the case of highly symmetric molecules) starting positions, setting -randomstarts to a value much larger will result in much slower run times.
Also calculate sub-Tanimoto score. See the Report File section for a complete description of how sub-Tanimoto is calculated.
Specifies starting the search at all heavy atoms of the larger molecule as well as the default inertial starts. The larger molecule is chosen by comparing the self shape-overlap terms of the query and database molecule. The -subrocs option is especially useful when the query and database molecules have a large difference in size.
A color force field is used by default and optimization against shape and color is the default overlap method. As an easy way to run ROCS with just shape overlap, this flag is the equivalent of setting:
-chemff none -optchem false -rankby tanimoto
Perform scoring calculation only on input molecules. Sets the following flags:
-opt false -optchem false -besthits 0 -maxhits 0 -scdbase
Color-force-field name. Either the name of one of the built-in color force fields (ImplicitMillsDean or ExplicitMillsDean) or the name of a user-defined color force field file. The format of this file is given in the Color Force Field section.
[default = ImplicitMillsDean]
Create an input file suitable for input to EON. This file will contain one or more conformers, aligned by ROCS, and output to an OEB file. The query will also be written to the beginning of the file so that this file is the only input required to feed into EON. By default, this file will contain up to 3 conformers of the top 1000 ROCS molecules. The number of conformers per molecule can be controlled with -eon_maxconfs while the total number of molecules can be controlled with -eon_input_size. By default, the file will be named PREFIX_eon_input_N.oeb.gz, but the actual name can be controlled via the -eon_input_file flag.
[default = false]
Number of conformers per molecule to be written to the EON input file. Has no effect unless -eon_input is true.
[default = 3]
Number of top molecules to keep and write to the EON input file. If a value of 0 is given, all ROCS input molecules will be aligned and written to the EON input file.
[default = 1000]
The example commands in this section can be run with files found under the appropriate version directory in examples/rocs under the top level installation directory.
ROCS always requires at the very least a file containing the query molecule(s) and a file containing the database molecule(s). The query file follows the -query command line flag and database file follows the -dbase flag. When ROCS is given no other arguments besides a query file and a database file, it will attempt to read the first query molecule, fit all database molecules to the query molecule, and write out the top 500 structures that have a Tanimoto Combo score above a given cutoff (default cutoff = -1.0). It is important to note that a matching structure, or hi, is the best fitting conformer of a database molecule. Only the best fitting conformer of any molecule will be written out. Even if multiple conformers of a molecule pass the cutoff, only the conformer which fits the best will be written out by default.
ROCS writes a structure file and a report file for each query molecule. The -prefix command line switch is used to name these files. The default prefix is rocs. The output structure file is by default sdf so that Shape Tanimoto and other calculated values can be included as tagged data, but the format can be changed by using the -oformat flag or by giving a specific filename using -hitsfile.
Note that as of ROCS 2.4, the defaults include using a color force field (ImplicitMillsDean), optimization against chemistry (-optchem true) and ranking the hitlist via TanimotoCombo (-rankby TanimotoCombo).
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf
will cause structures in the file database.oeb.gz that match query molecule in 4cox.sdf to be written to a file called rocs_hits_1.sdf. A tab-delimited report file containing the scores will be written to rocs_1.rpt. If rocs_hits_1.sdf is viewed in VIDA, hits can be visually compared with the query and the numerical scores will appear in the spreadsheet. Molecules in the hits file and the report file will be ranked by their TanimotoCombo score.
To prevent continually over-writing output files, the -prefix flag allows you to give unique names to the files.
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -prefix FOO
will write the hit structures into a file named FOO_hits_1.sdf and the overlay values will be in a file called FOO_1.rpt. As you follow the rest of the examples in this section, you may wish to use different prefixes each time so that you can compare how the output files differ.
The -cutoff flag is used to control which database molecules are considered hits. By default this is set at -1.0. The following demonstrates changing the cutoff from the default value
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -cutoff 1.0
The difficulty in choosing a cutoff value is that the number of hits at a given value is not usually known a priori, so setting too high of a cutoff could result in no hits. The -besthits and -maxhits flags can be used in conjunction with specifying a cutoff value to coax ROCS into giving output of a manageable size. Quick searches can be done to assess an appropriate cutoff values for a particular query molecule. The following demonstrates a search that will give a quick answer:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -cutoff 1.0 -maxhits 20
After 20 hits are found above a combo score of 1.0 in database.oeb.gz for the query molecule(s) in 4cox.sdf the search terminates and the results are written. This option prevents the entire database file from being searched if a sufficient number of hits are found before the end of the database file. Finding the best N hits above a threshold tends to be a more common exercise. If the top N hits of a database up to a maximum of 100 and above a value of 1.0 are desired, the following search can be done:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -cutoff 1.0 -besthits 100
If you just want the best N hits regardless of the cutoff, then using the default cutoff of -1.0 along with -besthits generates the N best:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -besthits 100
If a report file alone is desired, the output of matching structures can be suppressed with the -nostructs option. For example:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -cutoff 1.0 -nostructs
will only generate a report file for matching structures but the matches will not be written to a structure file.
By default, ROCS uses an inertial frame alignment to generate 4 separate starting positions, optimizes all 4 overlays and selects the best match of the 4. By default, this inertial frame alignment aligns the centers-of-mass of the two structures being aligned. If either molecule is substantially smaller than the other, this may not be the best starting position, so the choice to use random starting positions is offered. The command:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -randomstarts 20
will use 20 random starting positions and keep the best score. Runtime is proportional to the number of starting positions, so using a large number for randomstarts can significantly slow down a ROCS job.
ROCS also calculates the Tversky coefficient based either on the fit (database) molecule (FitTversky) or on the reference (query) molecule (RefTversky). These scores will appear in the report file and in SD tags if the structure are written to an SD or OEB file. ROCS can use these other scores as the ranking score for the hitlist by using the -rankby switch.
To search a database and find the best 300 hits, scored by the FitTverskyCombo coefficient weighted to each database molecule:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -rankby FitTverskyCombo -besthits 300
A chemical force field is used by default (ImplicitMillsDean) but a different one can be specified. Please refer to the chemical force field (CFF) section for a description of how to define a chemical force field. To simply calculate the CFF score after finding the best alignment based on shape use the -chemff option. For example:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -chemff ExplicitMillsDean
To turn off all color and run ROCS as shape overlap only, you can use the -shapeonly flag:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -shapeonly
To write out a file for input into EON, containing the top 1000 ROCS hits with 3 conformers per output molecule:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -eon_input
To write all ROCS hits to the EON input file:
prompt> rocs -dbase database.oeb.gz -query 4cox.sdf -eon_input -eon_input_size 0
The ROCS report file format appears as a tab-delimited file with the following fields. Since the names of the query and the hits are of indeterminate length, fixed size fields for these names could result in loss of information. Unfortunately this gives a file that is hard to read in a terminal session, but it can easily be read into a spreadsheet program or into the spreadsheet in VIDA.
The chemical force field can be used to measure chemical complementarity, and to refine shape based superpositions based on chemical similarity. The CFF is composed of SMARTS rules that determine chemical centers, plus rules to determine how such centers interact.
Two color force fields, ImplicitMillsDean and ExplicitMillsDean, are built into ROCS. Both these force fields define six similar TYPE color force-fields. The types are hydrogen-bond donors, hydrogen-bond acceptors, hydrophobes, anions, cations, and rings. The ImplicitMillsDean force field is recommended.
ImplicitMillsDean includes a simple pKa model that assumes pH=7. It defines cations, anions, donors and acceptors in such a way that they will be assigned the appropriate value independent of the protonation state in the dbase or query file. For example, if a molecule contains a carboxylate, ImplicitMillsDean will consider it an anionic center independent of whether it is protonated or deprotonated in the dbase file. This is convenient for searching databases which have not had careful curation of their protonation states. The ExplicitMillsDean file has a similar overall interaction model, however, it does not include a pKa model. It interprets the protonation and charge state of each molecule exactly as it is in the database. Thus, if a sulfate is protonated and neutral, it will not consider it an anion.
The hydrogen-bond models in both ImplicitMillsDean and ExplicitMillsDean are extensions of the original model presented by Mills and Dean ([MillsDean-1996]). They both have donors and acceptors segregated into strong, moderate and weak categories.
As an alternative to the built-in force fields, the user can define a new color force field using the format described in this section. The following is a simplified example of a color force field specification.
DEFINE hetero [#7,#8,#15,#16] DEFINE notNearHetero [!#1;!$($hetero);!$(\*[$hetero])] # # TYPE donor TYPE acceptor TYPE rings TYPE positive TYPE negative TYPE structural # # PATTERN donor [$hetero;H] PATTERN acceptor [#8&!$(\*~N~[OD1]),#7&H0;!$([D4]);!$([D3]-\*=,:[$hetero])] PATTERN rings [R]~1~[R]~[R]~[R]1 PATTERN rings [R]~1~[R]~[R]~[R]~[R]1 PATTERN rings [R]~1~[R]~[R]~[R]~[R]~[R]1 PATTERN rings [R]~1~[R]~[R]~[R]~[R]~[R]~[R]1 PATTERN positive [+,$([N;!$(\*-\*=O)])] PATTERN negative [-] PATTERN negative [OD1+0]-[!#7D3]~[OD1+0] PATTERN negative [OD1+0]-[!#7D4](~[OD1+0])~[OD1+0] PATTERN structural [$notNearHetero] # # INTERACTION donor donor attractive gaussian weight=1.0 radius=1.0 INTERACTION acceptor acceptor attractive gaussian weight=1.0 radius=1.0 INTERACTION rings rings attractive gaussian weight=1.0 radius=1.0 INTERACTION positive positive attractive gaussian weight=1.0 radius=1.0 INTERACTION negative negative attractive gaussian weight=1.0 radius=1.0 INTERACTION structural structural attractive gaussian weight=1.0 radius=1.0
There are four basic keywords in a cff file: DEFINE, TYPE, PATTERN, and INTERACTION. The TYPE field can be any user-defined term. TYPES can be any user-specified string such as “donor”, “acceptor”, “lipophilic anion” etc. The PATTERN keyword is used to associate SMARTS patterns with these types. There is no restriction on the number of patterns that can be associated with a user defined type. The position in Cartesian space of the PATTERN is taken as the average of the coordinates of the atoms that match the SMARTS pattern. If the desired location of the PATTERN is on a single atom of a larger SMARTS pattern recursive SMARTS (written as ‘[$(SMARTS)]’ can be used to this effect. Only the first atom in a recursive SMARTS pattern ‘matches’ the molecule, and the rest of the SMARTS pattern defines an environment. By writing a SMARTS pattern in recursive notation the location of the PATTERN will be taken as the atomic position of the first matching atom in the pattern. In order to simplify both reading and writing SMARTS, intermediate SMARTS can be associated with words using the DEFINE keyword. Once defined, these words can then be used as atom primitives in subsequent SMARTS patterns with the $ prefix (see “DEFINE hetero” and “PATTERN donor” above).
Interactions between types are associated with the INTERACTION keyword. Two user-defined types must be listed, and whether their interaction is attractive or repulsive. The height and radius can be modified by keywords WEIGHT and RADIUS. At present, the only alternative to a Gaussian decay is invoked by the DISCRETE keyword. A discrete interaction contributes all of WEIGHT if the inter-type distance is less than RADIUS, or zero. Since it is not differentiable it makes no contribution to optimization (i.e. because the gradient of a DISCRETE function is 0 or infinite).