Data handling in Orion

Use Cases

There are three main ways that one can manage and process data in Orion: datasets, files, and shard collections. They have tradeoffs related to scalability, parallelizability, and the tools that are available and appropriate to process them.

Dataset-oriented use cases can be split into two categories based on dataset size and usage patterns. So there are broadly four use cases to consider. The scale boundaries for these use cases aren’t as important as understanding the general ideas and design tradeoffs between them. Understanding these broad categories can help users design workflows that will take maximum advantage of Orions scalability.

  1. Small datasets that a medicinal chemist may wish to analyze deeply. This means iterative design, computing new properties on the fly, complex searches and analysis. In this case, storing the records in the database as datasets and using the various UI analysis/design components are appropriate choices. One uploads data as datasets using the ETL floes and processes the data with floes where the source and sink cubes read/write from the database.

    Characteristics:

    • Data size: \(10^0 - 10^4\) records

    • I/O pattern: per cell read/write/delete

    • Parallel I/O: small scale (<10 write streams, <100 read streams via replicas)

    • Read after write consistency: <10ms per write

  2. Medium sized datasets where one wants to do computations and filtering to to potentially select more focused sets for further analysis. These might be HTS-type workflows. Here, datasets in the database plus normal UI components are appropriate, as are using the normal ETL floes and dataset oriented source and sink cubes. Here one needs to be sensitive about scale. On the bigger end of the range, complex records (eg. many fields on the record or lots of conformers with lots of their own fields) will take a long time to load and search in the database. Some of the complex UI features and searches may not be available or may be slow. At some point switching from datasets in the database to S3 files will be a better choice.

    Characteristics:

    • Data size: \(10^4 - 10^6\) records

    • I/O pattern: per cell read/write/delete

    • Parallel I/O: small scale (<10 write streams, <100 read streams via replicas)

    • Read after write consistency: 10ms -> 1s per write, more writes imply higher replication lag

  3. S3 Files. Beyond a certain size, it’s not a good use of resources to cram those records into the database. There’s a point where the UI tools don’t allow users to take advantage of the data in the database, so one is just paying extra overhead. Internal database overhead increases with dataset size and that compounds the issue. At that point it’s worth switching to a file-based scheme. One uses files in S3 as the source and sink for ones floes. A source cube will read from an S3 file, convert to datarecords, and pass those datarecords on. Computational cubes can be reused across all use cases #1-4 since they process datarecords. On the output side the sink cube will take datarecords and write them to an output S3 file, either as binary records or after conversion to another format.

    Characteristics:

    • Data size: \(10^4 - 10^7\) records

    • I/O pattern: whole file put/get (no random access)

    • Parallel I/O: medium scale (<100 concurrent read streams, single writer), last write wins

    • Read after write consistency: <1s

  4. Shard Collections. This is a parallel version of the file-based scheme (#3), so for massive scale one can take advantage of the collection interface to read and/or write S3 files in parallel. The main workfloes are similar to individual files, except that the input can be read by a parallel cube from multiple shards, and/or the output can be written from parallel cubes to multiple shards.

    Characteristics:

    • Data size: \(>10^6\) records

    • Mutability: per shard read/write (i.e. an array of S3 files)

    • Parallel I/O: massive scale (many thousands of concurrent shard puts/gets)

    • Read after write consistency: <1s per shard, <10ms for metadata

    See also

    More information about collections can be found here Shard Collections

Records

In all of the above use cases, the granularity of data being processed is a datarecord. Datarecords are flexible, general purpose data containers that allow arbitrary nesting of fields and records.

There is a significant cost to processing complex data records with large amounts of data. Individual field and record sizes should be kept relatively small (100MB total per record). If a field has the potential to be large, then its data should be stored as a file or shard in Orion, and the field should store a reference (e.g. orionclient.links.OrionLink). Using the database to store large records (>100MB) is not supported.

The organization of the fields in a datarecord impact the types of data processing available in the user interface for Orion. Generally, top-level fields in a datarecord can be viewed/filtered in the Analyze page. Child records and nested fields are only accessible using the datarecord API.

Services

All of these broad use cases share some common characteristics. First, the data processing is generally stream-oriented. That is, the cube/floe API, used for performing computations, works by passing datarecords between connected ports. With some exceptions, (e.g. hitlist and other aggregation cubes) data is processed record-at-a-time.

Once datarecords are processed, they are stored either in the database as datasets or in files/shards. The database provides a HTTP interface to allow access to and basic queries of datasets, however it does not provide all possible access methods or queries. File/shard data can be retrieved but there are no built-in facilities for querying and/or subsetting the file/shard data.

For custom search functionality and specialized data access patterns, separate services in Orion are used to provide access to persistent data with this additional functionality. These are generally designed with APIs that are compatible with the cube/floe API so these data services can act as sources for regular floes.

Example:

  • MAAS/Corporate database collections: These are large (\(10^4 - 10^7\)) sets of molecules that are relatively static–updated on nightly or weekly. They can be a shared resource. They are typically used as input to various workfloes or are subsetted via searching.