FastROCS comes with a variety of scripts for configuring, reporting, and debugging.
ShapeDatabaseServer.py <database> [portnumber=8080]
Runs the FastROCS Server. The preferred usage of ShapeDatabaseServer.py is as a service that is left continuously running to avoid the overhead of reading and parsing the database off disk. Run the above command to start a server on a suitable host. By default the service will run on port 8080, though this can be changed by the last command line option.
The server will output dots while loading the database into memory, which can take a considerable amount of time. When the server is done loading the database it will report a total and then go to sleep waiting for queries on the specified port. Queries are serviced through the XMLRPC protocol using the built in libraries that ship with a standard python distribution.
The database load time can be significantly improved by running ShapeDatabasePrep.py on the file first and saving it as uncompressed OEB. The ShapeDatabaseChunker.py program described later will perform such caching.
ShapeDatabasePrep.py <database.oeb> <prepped_database.oeb> [max_confs]
Prepares an OEB file for faster load performance (~10x) into ShapeDatabaseServer.py. It is recommended to store the resulting OEB as .oeb, not .oeb.gz for faster load performance. A corresponding .oeb.idx index file will also be created that will also improve load performance. The disk space loss due to not using gzip compression is offset by removing unnecessary information from the OEB file as well as the newer PRE-compressed OEB format.
The maximum number of conformers per molecule can also be reduced at the same time by specifying the third max_confs argument.
The PRE-compress OEB format may not be readable by older OpenEye products built on versions of OEChem prior to OEChem 2.0.2 (2014.Oct).
ShapeDatabaseClient.py <server:port> <query> <results> [num_hits = 100]
Example script to send a query to a specified ShapeDatabaseServer. Rudimentary progress is shown as a fraction printed to standard out.
This is meant to be a simple example that can be customized to a particular client application. For example, the Vida client was adapted from this script.
ShapeDatabaseClientHistogram.py <server:port> <query> <results> [num_hits = 100]
Example script to send a query to a specified ShapeDatabaseServer and print the histogram of scores for the entire database of molecules. The histogram is a simple ASCII representation of the distribution. The histogram will also be updated in ‘real-time’ as the query progresses.
ShapeDatabaseIsLoaded.py [-blocking] [-h] <server:port>
Returns whether the database at server:port has completed the initial load step. By default, the program will return immediately with either true or false. Specifying the -blocking argument will tell the program to wait until the server has finished.
To spread a database across multiple servers the database must first be appropriately chunked. A ShapeDatabaseServer.py is started on each host with its corresponding database chunk. ShapeDatabaseProxy.py is then used to tie all the servers together to appear like a single shape database service to client applications. Theoretically, there is no need for the client to know whether it is querying one or many servers.
The general work-flow is as follows:
Split up database into 2 servers
ShapeDatabaseChunker.py database.oeb.gz database_shapedb.oeb.gz 2
Start a shape database server on each host.
# host1 ShapeDatabaseServer.py database_shapedb_1.oeb.gz # host2: ShapeDatabaseServer.py database_shapedb_2.oeb.gz
Start a proxy server to point to all the hosts. Note, a different port number like 8081 may need to be specified if the proxy server is being started on a host that already has a ShapeDatabaseServer.py started on it.
# proxy ShapeDatabaseProxy.py host1:8080 host2:8080 8081
Query the proxy server
ShapeDatabaseClient.py proxy:8081 query.sdf overlays.oeb
ShapeDatabaseChunker.py <database> <prefix> <n_servers>
Split a database into n_servers chunks. Due to the nature of the OEShapeDatabase the chunking is performed based upon the number of heavy atoms in each molecule. The OEShapeDatabase will actually triage molecules by heavy atom counts, so it is better to keep molecules with similar heavy atom counts together.
The chunker also takes the opportunity to cache a self shape term into the OEB file using the OESetCachedSelfShape function. This significantly improves (~5x) database load time.
ShapeDatabaseProxy.py <server 1> <server 2> ... <server n> [portnumber=8080]
Start a server to tie multiple remote ShapeDatabaseServer’s together to appear as a single server. This is useful since there is no change in client code to migrate from a single server to multiple servers since the XMLRPC interface is exactly the same.
The ShapeDatabaseProxy is sufficiently performant that is does not need its own dedicated node. It it perfectly acceptable to run it on the same server as one of the ShapeDatabaseServers.
ShapeDatabaseOEThrowSetLevel.py (-debug|-info|-warning) <server:port>
Adjust the verbosity of the server running on server:port. Only one of the levels can be specified. The verbosity level of “-debug” is useful for obtaining some performance statistics about how well its servicing queries.
CustomColorFFPrep.py <input> <output.oeb>
Caches custom color atoms onto molecules. The output must be OEB. OEShapeDatabase, ShapeDatabaseServer.py, and ShapeDatabaseClient.py will honor cached color atoms if present on the molecules. Make sure the script is run on both the database and the query molecules before running a search.
BestShapeOverlay.py <database> [<queries> ... ]
Return the database for each query sorted by their rank in that search. Attaches the shape Tanimoto as SD data. Demonstrates the simplest possible usage of the OEShapeDatabase including keeping a parallel list of compressed in-memory molecules to be used as output.
The OEShapeDatabase does not store molecular information, only coordinate information from each conformer. If the original molecule is required for output it is the responsibility of the programmer to maintain the relationship of the OEShapeDatabase indices to molecule. The reason for this design decision is to not force a additional memory overhead for an already memory intensive OEShapeDatabase object.
BestShapeOverlayMultiConfQuery.py <database> <queries> [--nHits 100] [--cutoff N] [--tversky]
This script gives an example of searching the database with every conformer of every query. The number of results depends on nHits and has a default value of 100 per search. A cutoff can also be specified and the similarity scoring function can be set to Tversky by adding the --tversky flag to the command line (default=Tanimoto). Results are written to one file per query molecule, with each conformer of the query molecule’s results listed sequentially. The query conformer id is saved as an SD tag for ease of identification.
Running this script with large numbers of queries and high numbers of conformers will deplete system memory rapidly.
SphereExclusionClustering.py <database> <clusters.oeb> [shape tanimoto cutoff = 0.75]
Uses the OEShapeDatabase to cluster the input database into shape clusters based on a rudimentary clustering algorithm. The output is an OEB file with members of each cluster attached as children to the cluster head molecule.
Conformers from the same molecule may be assigned to two separate clusters. No attempt is made to deal with this problem as the solution is dependent of what the clustering will be used for. This is supposed to just show the feasibility of shape clustering.
ShapeDistanceMatrix.py [-shapeOnly] [-dbase] <database> [-matrix] <clusters.csv>
Calculates the distance between all molecules in database with themselves. There will only be one entry per molecule, though all conformers will be considered in the comparison. This means the conformer used in a particular row or column of the matrix will not be consistent. The complete distance matrix is written out to the clusters.csv in comma separated format, useful for feeding into downstream clustering software.
The values output will be a “distance”, not the tanimotos. That means a perfect match is ‘0.0’, not 1.0 or 2.0 respectively. The default is to use Tanimoto Combo. The -shapeOnly flag can be used to get only the shape distance.
This will generate O(N^2) amount of data and runtime. This is not a cheap script to run. This script can generally handle 1,000s to 10,000s in a reasonable timeframe on a modern GPU and a machine with decent memory and disk space.