Data handling in Orion

Use Cases

There are three main ways that one can manage and process data in Orion: datasets, files, and shard collections. They have tradeoffs related to scalability, parallelizability, and the tools that are available and appropriate to process them.

Dataset-oriented use cases can be split into two categories based on dataset size and usage patterns. So there are broadly four use cases to consider. The scale boundaries for these use cases aren’t as important as understanding the general ideas and design tradeoffs between them. Understanding these broad categories can help users design workflows that will take maximum advantage of Orion’s scalability.

  1. Small datasets that a medicinal chemist may wish to analyze deeply. This means iterative design, computing new properties on the fly, complex searches, and analysis. In this case, storing the records in the database as datasets and using the various UI analysis/design components are appropriate choices. One uploads data as datasets using the ETL floes and processes the data with floes where the source and sink cubes read/write from the database.

    Characteristics:

    • Data size: \(10^0 - 10^4\) records.

    • I/O pattern: per cell read/write/delete.

    • Parallel I/O: small scale (<10 write streams, <100 read streams via replicas).

    • Read after write consistency: <10ms per write.

  2. Medium sized datasets where one wants to do computations and filtering to to potentially select more focused sets for further analysis. These might be HTS-type workflows. Here, datasets in the database plus normal UI components are appropriate, as are using the normal ETL floes and dataset oriented source and sink cubes. Here, one needs to be sensitive about scale. On the bigger end of the range, complex records (for example, many fields on the record or lots of conformers with lots of their own fields) will take a long time to load and search in the database. Some of the complex UI features and searches may not be available or may be slow. At some point switching from datasets in the database to S3 files will be a better choice.

    Characteristics:

    • Data size: \(10^4 - 10^6\) records.

    • I/O pattern: per cell read/write/delete.

    • Parallel I/O: small scale (<10 write streams, <100 read streams via replicas).

    • Read after write consistency: 10ms -> 1s per write, more writes imply higher replication lag.

  3. S3 Files. Beyond a certain size, it’s not a good use of resources to cram those records into the database. There’s a point where the UI tools don’t allow users to take advantage of the data in the database, so one is just paying extra overhead. Internal database overhead increases with dataset size, and that compounds the issue. At that point, it’s worth switching to a file-based scheme. One uses files in S3 as the source and sink for one’s floes. A source cube will read from an S3 file, convert to datarecords, and pass those datarecords on. Computational cubes can be reused across all use cases #1-4 since they process datarecords. On the output side the sink cube will take datarecords and write them to an output S3 file, either as binary records or after conversion to another format.

    Characteristics:

    • Data size: \(10^4 - 10^7\) records.

    • I/O pattern: whole file put/get (no random access).

    • Parallel I/O: medium scale (<100 concurrent read streams, single writer), last write wins.

    • Read after write consistency: <1s.

  4. Shard Collections. This is a parallel version of the file-based scheme (#3), so for massive scale one can take advantage of the collection interface to read and/or write S3 files in parallel. The main workfloes are similar to individual files, except that the input can be read by a parallel cube from multiple shards, and/or the output can be written from parallel cubes to multiple shards.

    Characteristics:

    • Data size: \(>10^6\) records.

    • Mutability: per shard read/write (i.e. an array of S3 files).

    • Parallel I/O: massive scale (many thousands of concurrent shard puts/gets).

    • Read after write consistency: <1s per shard, <10ms for metadata.

    See also

    More information about collections can be found here Shard Collections.

Records

In all of the above use cases, the granularity of data being processed is a datarecord. Datarecords are flexible, general purpose data containers that allow arbitrary nesting of fields and records.

There is a significant cost to processing complex data records with large amounts of data. Individual field and record sizes should be kept relatively small (100MB total per record). If a field has the potential to be large, then its data should be stored as a file or shard in Orion, and the field should store a reference (e.g. orionclient.links.OrionLink). Using the database to store large records (>100MB) is not supported.

The organization of the fields in a datarecord impact the types of data processing available in the user interface for Orion. Generally, top-level fields in a datarecord can be viewed/filtered in the Analyze page. Child records and nested fields are only accessible using the datarecord API.

Services

All of these broad use cases share some common characteristics. First, the data processing is generally stream-oriented. That is, the cube/floe API, used for performing computations, works by passing datarecords between connected ports. With some exceptions, (e.g. hitlist and other aggregation cubes) data is processed record-at-a-time.

Once datarecords are processed, they are stored either in the database as datasets or in files/shards. The database provides an HTTP interface to allow access to and basic queries of datasets; however, it does not provide all possible access methods or queries. File/shard data can be retrieved, but there are no built-in facilities for querying and/or subsetting the file/shard data.

For custom search functionality and specialized data access patterns, separate services in Orion are used to provide access to persistent data with this additional functionality. These are generally designed with APIs that are compatible with the cube/floe API so these data services can act as sources for regular floes.

Example:

  • MAAS/Corporate database collections: These are large (\(10^4 - 10^7\)) sets of molecules that are relatively static—updated nightly or weekly. They can be a shared resource. They are typically used as input to various workfloes or are subsetted via searching.

Data Organization

The data organization system in Orion implements familiar concepts of paths and folders to organize and access data. Users can catalogue the data according to the way they would like to share data among team members in a project or organization. Orion defines several filesystems that control access to the data and sharing capabilities. Contingent on user role, data can be freely moved between these filesystems.

My Data

Data in this filesystem is private and accessible only by the owner. If not specified otherwise, all resources uploaded to Orion are saved in My Data. This is also the default location for WorkFloes outputs. This filesystem can be accessed with a path formatted like:

/project/<project-id>/user/<username>/<path>

Shared Spaces

In this location, users can create Workspaces. Data in a Workspace is shared between project members selected by the user. Users can create multiple Workspaces, and each Workspace can have different members with different access privileges (read/write/delete). Data moved from other filesystems (e.g. My Data) to a Workspace inherits sharing capabilities set by this Workspace. This filesystem can be accessed with a path formatted like:

/project/<project-id>/workspace/<workspace-name>/<path>

Team Data

The data created or moved to this filesystem is available to all project members with read, write, and delete privileges. This filesystem can be accessed with path:

/project/<project-id>/team/<path>

Organization Data

Data in this filesystem is shared between all members of the Organization. Only stack- and org-admins have read/write privileges to this filesystem. All other users have read-only access. This filesystem can be accessed with a path formatted like:

/organization

Note

/organization is an alias to the organization-wide filesystem. It can be alternatively accessed with path /project/<org-project-id>/team/<path>.

Trash

Data or folders that are deleted from any filesystem land in the Trash filesystem. The deleted information is held there for 30 days, after which it is deleted permanently. Similar to My Data, Trash is a private filesystem to each user; therefore data deleted from Organization, Team or Workspaces is no longer visible to the users with which it was shared. Data can be undeleted by moving it back to any other filesystem. Note that the reference to the location of the resource before the delete operation is not stored, so it is up to the user where to restore the data to. Datasets, Files and Collections in Trash can be deleted permanently before the 30 day period has elapsed. Folders cannot be deleted from Trash by the user - they are automatically cleaned when they do not contain any data. This filesystem can be accessed with a path formatted like:

/project/<project-id>/trash/<username>/<path>

See also

Orion CLI Folders section

Tags

Tags can be applied to resources like files, collections, datasets, and WorkFloes. Tags allow for additional filtering of resources from Orion UI and Orion CLI. Each resource can be tagged with multiple tags. Orion additionally provides system tags, which are a limited set of fixed metadata. One example of when system tags are used is when a dataset is finalized, and the tags indicate that molecules with certain features are present in the dataset. Possible system tags are:

  • 2D: contains at least one molecule with 2D coordinates

  • 3D: contains at least one molecule with 3D coordinates

  • multiconformer: contains at least one molecule with multiple conformers

  • receptor: contains at least one molecule or conformer where OEIsReceptor() is True

Similarly, WorkFloes also have a set of system tags:

  • analyze: the floe can be used for analyze operations in the user interface

  • ETL: the floe can be used for ETL operations in the user interface

  • export: the floe can be used for export operations in the user interface

Additionally any Dataset, File or Collection that is created as an output of WorkFloe is tagged with Job <job-id> tag.