Datasets vs. Files. vs. Collections

Orion has three primary mechanisms for storing data. These are described below.

Datasets

Datasets are the most interactive data storage with respect to the Orion UI, and have the following properties.

  • Data in datasets can be viewed and plotted in Orion’s Analyze page.

  • Molecules in datasets are viewable in the Orion’s ‘3D’ page.

  • Datasets have named columns and all data in a give column is always of the same type. E.g., Float, String, Molecule, etc..

  • No formal limit exists on the size of datasets in Orion but

    • datasets with over 100,000 rows can not be made active, and thus can’t be used in Orion’s Analyze or 3D page. They can be passed to floes that accept datasets.

    • datasets with over 1 million records are not recommended and can cause problems in floes.

Files

Files are the most basic form of data storage in Orion.

  • Files with some known formats (e.g., molecular files formats) can be converted to datasets by Orion ETL floes

    • Other than this Orion does generally only stores files as raw data and doesn’t read or interpret them

  • File can be of any type/format just like any file on your local hard disk.

  • Files can’t be made active in Orion, and thus can’t be using in Orion’s Analyze or 3D page.

  • Files can be passed to floes that accept them, but most floes to not take files directly.

Collections

Collection are a specialize storage mechanism designed to maximize I/O performance in floes like Gigadock.

  • Collections are groups of files in a logical container.

  • Collections can hold the same data as datasets but at a much large scale.

    • Collections can hold billions of records/molecules

  • Collections cannot be made active and thus can’t be used in Orion’s Analyze or 3D page.

  • Generally only specialized floes like Gigadock read and/or write collections.

  • Collections are currently not strongly types, thus care must be taken to pass the correct collection to a floe that takes a collection.