Data Storage: Datasets, Files, and Collections

Orion has three primary mechanisms for storing data.

Files

Files are the most basic form of data storage.

  • Files with any OEChem-readable format (including SMI, OEDU, and PDB) can be converted to datasets by ETL floes or by direct upload. This page is another resource that includes a more programmatic approach.

    • Files of any type or format can be uploaded to Orion, but files that are not OEChem-readable will be stored only as raw data and will not be read or interpreted.

  • Files cannot be made active, and thus cannot be examined in the 3D & Analyze page.

  • Files can be passed to floes that accept them, but most floes do not take files directly.

Datasets

Datasets provide the most interactive data storage with respect to the Orion UI and have the following properties.

  • Molecules and data in datasets can be viewed and plotted in the 3D & Analyze page.

  • Datasets have named columns, and all data in a given column is always of the same type, such as float, string, or molecule.

  • No formal limit exists on the size of datasets in Orion, but:

    • datasets with over 100,000 records are not recommended for use in the 3D & Analyze page.

    • datasets with a very large number of records can be sent to floes, but using datasets will be less efficient than using collections for processing data in parallel.

    • datasets with over one million records are not recommended and can cause problems in floes. Datasets are not recommended for storing large amounts of data.

Collections

Collections are specialized storage mechanisms designed to maximize I/O performance in floes like Gigadock.

  • Collections are groups of files in a logical container.

  • Collections can hold the same data as datasets but at a much larger scale: billions of records or molecules.

  • Collections cannot be made active and thus cannot be used in the 3D & Analyze page.

  • Most floes read and write datasets. Collections are used for larger scale calculations such as those in Gigadock and some cryptic pocket detection floes.

  • Collections are currently not strongly typed, thus care must be taken to pass the correct collection to a floe that takes a collection.

  • Gigadock collections that are curated and provided by OpenEye can be found in the Organization Data / OpenEye Data / Gigadocking Collections file.

  • The most recently updated collections, as well as the Database Curation Policy, can be found here.