Data handling in Orion

Use Cases

There are three main ways that you can manage and process data in Orion: datasets, files, and shard collections. They have tradeoffs related to scalability, parallelizability, and the tools that are available and appropriate to process them.

Records

In all of the above use cases, the granularity of data being processed is a datarecord. Datarecords are flexible, general purpose data containers that allow arbitrary nesting of fields and records.

There is a significant cost to processing complex data records with large amounts of data. Individual field and record sizes should be kept relatively small (100MB total per record). If a field has the potential to be large, then its data should be stored as a file or shard in Orion, and the field should store a reference (e.g. orionclient.links.OrionLink). Using the database to store large records (>100MB) is not supported.

The organization of the fields in a datarecord impact the types of data processing available in the user interface for Orion. Generally, top-level fields in a datarecord can be viewed/filtered in the Analyze page. Child records and nested fields are only accessible using the datarecord API.

Services

All of these broad use cases share some common characteristics. First, the data processing is generally stream-oriented. That is, the cube/floe API, used for performing computations, works by passing datarecords between connected ports. With some exceptions, (for example, hit list and other aggregation cubes) data is processed record-at-a-time.

Once datarecords are processed, they are stored either in the database as datasets or in files/shards. The database provides an HTTP interface to allow access to and basic queries of datasets; however, it does not provide all possible access methods or queries. File/shard data can be retrieved, but there are no built-in facilities for querying and/or subsetting the file/shard data.

For custom search functionality and specialized data access patterns, separate services in Orion are used to provide access to persistent data with this additional functionality. These are generally designed with APIs that are compatible with the cube/floe API so these data services can act as sources for regular floes.

Example:

MAAS/Corporate database collections: These are large (\(10^4 - 10^7\)) sets of molecules that are relatively static—updated nightly or weekly. They can be a shared resource. They are typically used as input to various workfloes or are subsetted via searching.

Data Organization

The data organization system in Orion implements familiar concepts of paths and folders to organize and access data. Users can catalogue the data according to the way they would like to share data among team members in a project or organization. Orion defines several filesystems that control access to the data and sharing capabilities. Contingent on user role, data can be freely moved between these filesystems.

My Data

Data in this filesystem is private and accessible only by the owner. If not specified otherwise, all resources uploaded to Orion are saved in My Data. This is also the default location for WorkFloes outputs. This filesystem can be accessed with a path formatted like:

/project/<project-id>/user/<username>/<path>

Shared Spaces

In this location, users can create Workspaces. Data in a Workspace is shared between project members selected by the user. Users can create multiple Workspaces, and each Workspace can have different members with different access privileges (read/write/delete). Data moved from other filesystems (e.g. My Data) to a Workspace inherits sharing capabilities set by this Workspace. This filesystem can be accessed with a path formatted like:

/project/<project-id>/workspace/<workspace-name>/<path>

Team Data

The data created or moved to this filesystem is available to all project members with read, write, and delete privileges. This filesystem can be accessed with path:

/project/<project-id>/team/<path>

Organization Data

Data in this filesystem is shared between all members of the Organization. Only stack- and org-admins have read/write privileges to this filesystem. All other users have read-only access. This filesystem can be accessed with a path formatted like:

/organization

Note

/organization is an alias to the organization-wide filesystem. It can be alternatively accessed with path /project/<org-project-id>/team/<path>.

Trash

Data or folders that are deleted from any filesystem land in the Trash filesystem. The deleted information is held there for 30 days, after which it is deleted permanently. Similar to My Data, Trash is a private filesystem to each user; therefore data deleted from Organization, Team or Workspaces is no longer visible to the users with which it was shared. Data can be undeleted by moving it back to any other filesystem. Note that the reference to the location of the resource before the delete operation is not stored, so it is up to the user where to restore the data to. Datasets, Files and Collections in Trash can be deleted permanently before the 30 day period has elapsed. Folders cannot be deleted from Trash by the user - they are automatically cleaned when they do not contain any data. This filesystem can be accessed with a path formatted like:

/project/<project-id>/trash/<username>/<path>

Tags

Tags can be applied to resources like files, collections, datasets, and WorkFloes. Tags allow for additional filtering of resources from Orion UI and Orion CLI. Each resource can be tagged with multiple tags. Orion additionally provides system tags, which are a limited set of fixed metadata. One example of when system tags are used is when a dataset is finalized, and the tags indicate that molecules with certain features are present in the dataset. Possible system tags are:

2D: contains at least one molecule with 2D coordinates

3D: contains at least one molecule with 3D coordinates

multiconformer: contains at least one molecule with multiple conformers

receptor: contains at least one molecule or conformer where OEIsReceptor() is True

Similarly, WorkFloes also have a set of system tags:

analyze: the floe can be used for analyze operations in the user interface

ETL: the floe can be used for ETL operations in the user interface

export: the floe can be used for export operations in the user interface

Additionally any Dataset, File or Collection that is created as an output of WorkFloe is tagged with Job <job-id> tag.