Orion Integration

Floe was designed and implemented to be a core part of Orion. This documentation describes the integration between the Floe Python API and Orion.

Development Lifecycle

In general, the process of developing a cube is as follows:

The cube is written in Python on a developer’s machine.
Unit tests are written to test the cube’s logic.
The cube source code (with any non-pip installed dependencies) is archived into a tar.gz file.
The archive is uploaded to Orion.

WorkFloes may also be developed in a similar fashion.

Packaging

A valid Floe package is an archive (tar.gz) containing Python and JSON files. Both cubes and WorkFloes may be included in a package. Other file types, such as executable binaries or auxiliary data, may also be included.

Note

Be sure that any cubes in a package are importable from the package’s top level directory.

Below is an example directory structure for a Floe package archive containing WorkFloes and cubes:

project/
    cubes/
        some_cubes.py
        some_other_cubes.py
    floes/
        myfloe.py
        myotherfloe.py
    requirements.txt
    manifest.json

This example package contains a requirements.txt file to specify Python dependencies, as well as a manifest.json file, which is discussed in the following section.

Note

OSX Users must set the COPYFILE_DISABLE environment variable to 1 before creating an archive with tar. Otherwise, invalid Python files with .py extensions will be included in the archive (due to the AppleDouble format) leading to package inspection errors.

Note

Any files named setup.py are assumed to be for packaging purposes and will not be searched for cubes.

Manifest

Every Floe package must contain a manifest.json file containing information that describes the package.

Example manifest.json:

{
  "requirements": "requirements.txt",
  "name": "PSI4",
  "conda_dependencies": ["psi4=0.3.533", "numpy"],
  "conda_channels": ["psi4"],
  "version": "0.1.1",
  "base_os": "amazonlinux2",
  "python_version": "3.7.3",
  "build_requirements": {"memory": 4096},
  "documentation_root": "docs",
  "classify_floes": true,
  "debug": true,
  "deduplicate_files": true,
  "show_biggest_files": true,
  "show_home_files": true,
  "show_final_environment": true,
  "show_json_output": true,
  "use_mamba_solver": true,
  "prebuild": "prebuild.sh",
  "postbuild": "postbuild.sh"
}

requirements must contain a relative path (from manifest.json) to a valid pip requirements file.
name must contain a string name of the package.
version must contain a string version for the package.
python_version must contain the version of the desired Python interpreter (Python 3 only).
conda_dependencies, if provided, must be a list of packages installable from conda.
conda_channels, if provided, must be a list of valid conda channels.
build_requirements, if provided, must be a dictionary specifying the required hardware for building and inspecting the package. Actual hardware resources are determined using the Dominant Resource Factor.
- cpu, if provided, must be an integer (number of CPUs). Default is 1.
- gpu, if provided, must be >=1 (GPU required for package installation) or 0 (GPU not required to build and inspect the package). Default is 0.
- memory, if provided, must be an integer (number of mebibytes). Default is 2048.
- disk_space, if provided, must be an integer (number of mebibytes). Default is 256.
- instance_type, if provided, must be a string of the form described at Instance Type.
- instance_tags, (currently unused) if provided, must be a list of strings associated with Orion instance groups.
documentation_root, if provided, must be the relative path to a directory containing at minimum an index.html file.
base_os, specifies which operating system to build the package environment from. The options are ubuntu-18.04, ubuntu-20.04, and amazonlinux2, with the default operating system being ubuntu-20.04.
classify_floes, if set to true, will organize floes from this package in the Orion UI by their classification (shown as “Category” in UI).
debug, enables debug mode during package inspection. The package logs will contain more information, along with any errors, default: false
deduplicate_files, replace duplicate files with hard links, default: false
show_biggest_files, run several du and find commands to highlight the largest directories and files, default: false
show_home_files, list all files in the home directory, default: false
show_final_environment, export and show final environment, default: false
show_json_output, conda output in JSON without progress indicators, default: false
use_mamba_solver, install and enable the more efficient mamba dependency solver (mamba and its deps are later uninstalled), default: false
prebuild, if provided, must be a relative path (from manifest.json) to a bash script that will be executed as root before pip and all package dependencies are installed. This script is only executed when uploading a package that is building in Orion (i.e. uploading a .tar.gz package)
postbuild, if provided, must be a relative path (from manifest.json) to a bash script that will be executed as the floe user after all dependencies have been installed. This script is only executed when uploading a package that is building in Orion (i.e. uploading a .tar.gz package)
_code_only, if set to true, new Floe package images will be quickly created based on the last successfully created image with the same UUID, owner, and software dependencies. When _code_only: true is present in a Floe package’s manifest.json, then a new Floe package will be created with updated Cubes and Floes. Package logs will indicate that _code_only option is used with log statement “Reusing environment from package <ID>”

Caution

_code_only is an experimental option and should be used only for Floe development and no production Floe package should ever rely on it. This option can change (or even disappear) without warning in the future Orion release. When _code_only: true, version should change in manifest.json, or ingest will fail.

Note

Do not add unnecessary dependencies to your package. Each dependency added increases the size of the environment created in Orion, which also increases the start-up cost for WorkFloes.

Complex dependencies can cause conda to use large amounts of memory and time when building a package. While conda has improved, it may be necessary to increase the memory in build_requirements. Pinning each package to a specific channel <CHANNEL>::<PACKAGE> should help conda use fewer resources. Pinning also helps ensure it does not accidentally pick the wrong package source.

Package OS and GCC Versions

Note

Orion 2025.1 will update the previous default choice for base_os from amazonlinux2-v16.0 to ubuntu-22.04. Python dependencies which require compilation may be impacted, as the corresponding GCC version installed within the environment has also changed.

The choices for a manifest’s base_os, the corresponding GCC version and GLIBC version, and Orion version range is shown in the table below.

Operating System	GCC Version	GLIBC Version	Orion Releases	Compatible OpenEye-Toolkits
ubuntu-14.04	4.8	2.19	2018.4-2022.2	<=2023.1.1
ubuntu-18.04	7.4	2.27	2020.3-2022.2	<=2023.1.1
ubuntu-20.04	9.3	2.31	2020.3-2022.1	All
ubuntu-20.04	9.4	2.31	2022.2+	All
ubuntu-22.04	11.4.0	2.35	2023.2+	All
amazonlinux1	4.8	2.17	2020.1-2022.2	<=2023.1.1
amazonlinux2	7.3	2.26	2020.1+	<=2023.1.1
amazonlinux2023	11.4.1	2.34	2023.2+	All

OpenEye-Toolkits Version	GLIBC Min Version
>=2023.2.0	2.28
<=2023.1.1	2.17

Note

To identify the active orion version navigate to the Help/Version Info tab on the web interface.

NVIDIA Driver Versions

A recent version of the NVIDIA driver is provided on GPU instances. The desired version of CUDA should be installed through conda. See NVIDIA CUDA Toolkit Release Notes for which versions of CUDA are supported by each driver version.

Orion Release

NVIDIA Driver Release

2024.1.*

535.129.03

2024.2.*

535.129.03

2024.3.*

535.129.03

2025.1.*

535.129.03

2025.2.*

535.129.03

2025.3.*

570.144

Caution

Orion 2020.1 changed how GPU packages are built. First, most GPU packages no longer need a gpu build requirement. Second, the cuda9 instance tag was removed from scaling groups, a change that effectively breaks uploads of older GPU packages. The cuda9 instance tag must be removed from the build requirements. Otherwise, the package upload will fail with the message, “Error processing package: unsatisfiable requirements”. Instead of an instance tag, CUDA should be installed as a conda dependency.

Linting and Detection

Once a package is created with a valid manifest.json and requirements.txt files, you are almost ready to upload. Upon upload, Orion will first construct a virtual environment containing all of the dependencies specified, which can take several minutes. After building the environment, Orion will attempt to discover any cubes and WorkFloes contained in your package. If any errors are encountered while inspecting the package the inspection will fail. You can find these errors ahead of time by running the same detection code that Orion uses.

Example of running cube and WorkFloe detection:

$ cd projectroot
$ floe detect . --out=result.json

The output of the above command, result.json, will contain JSON representing the cubes and Floes discovered, as well as any errors encountered while doing so.

The Floe detection code can also print a simplified output, which contains any encountered errors:

$ cd projectroot
$ floe detect -s .

Before uploading, it is recommended that you check your cubes and Floes for minor issues. Floe comes with a built-in linting tool, called floe lint, which can be used to check many different properties of your cubes and WorkFloes.

$ floe lint --help
usage: floe lint [-h] [-l LEVEL] [-i IGNORE [IGNORE ...]] [-v] [-e] [-j JSON] N [N ...]
Floe Lint

positional arguments:
  N                     List of paths to lint

optional arguments:
  -h, --help            show this help message and exit
  -l LEVEL, --level LEVEL
                        The minimum linting level to report, default 0. Levels 0-4 are the least consequential. Levels >=5 indicate issues
                        that either cause subtle errors or prevent a floe from running at all.
  -i IGNORE [IGNORE ...], --ignore IGNORE [IGNORE ...]
                        List of file patterns to ignore
  -v, --verbose
  -e, --show-errors
  -j JSON, --json JSON  Writes output json to path

Hiding Cubes and WorkFloes

Cube and WorkFloe visibility is done by a naming convention. Any cube or WorkFloe with a name starting with an underscore will be skipped during detection. A common use case for this feature is to have a custom base class.

from floe.api import ComputeCube


class _HiddenBaseCube(ComputeCube):
    """This Cube will not be visible when uploaded to Orion"""
    pass


class MyCustomCube(_HiddenBaseCube):
   """This Cube will be visible"""
    pass

Hardware Requirements

A WorkFloe in Orion can specify hardware requirements and placement strategies on a per cube basis. The Orion scheduler will use the hardware information and placement strategy to place the cubes onto appropriate machines, while attempting to minimize cost and startup time. Hardware requirements and placement strategies are exposed via the parameters listed below. For each parameter listed, the requirements pertain to the cube if serial, or each copy of the cube if parallel.

In order to help maintain fairness in resource scheduling and cost accounting, Orion adjusts the requested amount of hardware according a Dominant Resource Factor.

Example hardware requirement parameter overrides:

from floe.api import ComputeCube

class ExampleCube(ComputeCube):

    parameter_overrides = {
        "gpu_count": {"default": 1},
        "spot_policy": {"default": "Allowed"},
    }

CPUs

The cpu_count parameter specifies how many CPUs are made available to the cube. One potential use case for allocating more than one CPU per cube is to allow cube authors to call out to an existing binary which has multi-CPU parallelism. CPUs provided to cubes are confined using the Linux cpuset facility.

GPUs

The gpu_count parameter specifies how many GPUs are made available to the cube. The allocated GPUs will be available /dev/nvidia[n] for each of the n GPUs provided. Additionally, /dev/nvidiactl, /dev/nvidia-uvm, and /dev/nvidia-uvm-tools are available to the cube.

If a cube’s software requires a GPU during installation, then gpu must be set to 1 or greater under build_requirements of the package Manifest.

A value less than 1 may be requested to allow several cube or service allocations to share one GPU. These “fractional GPUs” are appropriate when the GPU cannot be fully utilized by a single cube or service. For instance, one of the primary use-cases for fractional GPUs is to pack multiple molsearch databases onto one GPU. The Dominant Resource Factor (DRF) continues to apply, and you cannot have disproportionate allocation of resources. For example, you cannot give 100% of the CPUs to one cube and 100% of the GPUs to another.

In general, floes should use whole GPUs. Requesting fractional GPUs for serial and parallel cubes, while not strictly prohibited, could result in degraded performance and undefined behavior. However, a fractional GPU allocation may be used to effectively pack a mixture of CPU and GPU cubes onto an instance.

If a cube uses a fraction of an AWS GPU instance, then the cube’s hourly cost will be pro-rated based on that fraction and the instance’s hourly cost, but the Organization will still be charged for the full cost of the instance. While cost is not as big a factor for the cdns-g1 instances, it would still be best practice to ensure that there are cubes enough with fractional GPUs to fully utilize an instance.

See Dominant Resource Factor for examples of what resources are given when requesting fractional GPUs.

Memory

The memory_mb parameter specifies the amount of memory in mebibytes (MiB) to be allocated to the cube. Given system overheads, not all the memory allocated will be available. Consider requesting a couple hundred MiB more than is required. If a cube attempts to use more memory than is available, then the cube will be terminated.

Disk Space

The disk_space parameter specifies the amount of temporary disk space in mebibytes (a mebibyte, or MiB, is 1,048,576 bytes) to be made available to the cube. Given system overheads, not all disk space allocated will be available. Consider requesting a couple hundred MiB more than is required. If a cube attempts to use more space than is available, then system calls should return errors indicating there is no space. See Cube File Systems.

Instance Type

The instance_type parameter specifies a comma-separated list of instance types and families, to be included or excluded for consideration in execution of the cube. They must be prefixes of valid instance type names for healthy groups listed by Orion. Plain prefixes are combined with a logical OR. Negated conditions are combined with a logical AND. Negations override inclusion, regardless of specificity.

Prior to Orion 2024.1.1, negations in this parameter did not always work as expected. For example, c,!c5,!c6 would match m6i.xlarge.

As of Orion 2025.1.1 A serial cube can opt-in to allow execution on CDNS instances by including cdns-g1 in the instance_type parameter list.

Instance Tags

The instance_tags parameter specifies a comma-separated list of strings associated with Orion instance groups, limiting which are allowed to execute the cube. Instance tags are currently unused.

Spot policy

Cubes running in Orion may run on spot market instances to reduce costs. The spot_policy parameter can be used to control the usage of spot market instances. The following policies are available for the cube:

Allowed: Spot market instances may be used
Required: Spot market instances must be used
Preferred: Spot market instances are preferred
NotPreferred: Spot market instances are not preferred
Prohibited: Spot market instances must not be used

Note

It is possible to specify hardware constraints that are impossible to satisfy.
If possible, Orion will attempt to add more capacity to itself to satisfy cube requirements.
Hardware requirements and placement strategy parameters have no effect when running locally.

Capacity States

Orion provides a capacity state for running cubes, which indicates the status of how Orion cubes are doing with respect to fulfilling their desired resource usage.

See Capacity States for more information.

Additional Parallel Cube Parameters

Maximum Failures Per Work Item

Each work item will be retried up to max_failures times. If that limit is exceeded, the item is skipped. The default is 10.

Autoscale Workers

By default, Orion will autoscale the number of workers for each parallel cube, based on the number of items of work queued. Autoscaling may be turned off by setting autoscale to False.

Maximum Parallel Workers

The maximum number of parallel workers for a Cube (that is, copies) may be bounded by max_parallel. The default is 1000.

Minimum Parallel Workers

The minimum number of parallel workers for a cube (that is, copies) may be bounded by min_parallel. The default is 0.

Experimental Parallel Cube implementation

Advanced users may utilize the experimental parallel cube parameter in Orion to enable an experimental parallel cube implementation that avoids the 12 hour time limit for a single item of work.

Job Scheduling

A job is a WorkFloe instantiated in Orion. Given a perfectly elastic set of resources, each job will be dynamically provisioned with as much hardware as it can use. However, there are limits to resource availability, so Orion schedules job resources according to a notion of fairness that takes into account how much a job has been slowed down by other jobs.

Slowdown Ratio:: 1 - (resources a task has) / (resources a task would use given all healthy instances)

If the ratio is positive, this task is being blocked by other tasks.
If the ratio is 0, this task is not blocked at all.
If the ratio is negative, this task has more resources than it needs.

Jobs are placed in a descending queue, where jobs closer to the head of the queue are more likely to get resources. Queue position is determined by the resources used by of all of the jobs owned by the same user, as well as the resources of the job itself. Therefore, no user should be able to monopolize resources for extended periods of time.

Example:

Job A has 75 containers and wants 100

Job B has 2 containers and wants 3

Slowdown of Job A is 1 - (75/100) = .25

Slowdown of Job B is 1 - (2/3) = .33

Job B will be given resources first, then Job A

A job’s queue position may be affected by adjusting its weight. Modifying the weight should only be considered in exceptional circumstances.

Job Logs

When debugging floes and cubes, the job logs contain most of the information available, including: - the environment log generated while processing the package - stdout and stderr messages from all cubes - the job specification and parameters - job scaling activity - a graph of metrics collected (metrics collection can be specified when starting the floe)

The job logs can be viewed in a running floe or exported from a completed floe. There are slight differences between running and completed logs owing to some post-completion processing. On the primary Orion stack, job logs are retained for one month; on customer stacks the retention time may be longer. Viewing or exporting job logs does not carry any significant performance implications, although printing significant amounts of data to stdout is not recommended. The primary cost associated with job logs is S3 storage cost.

Cube File Systems

Beginning in Orion 2020.1, the home directories mounted into running Cubes are read-only. This was necessary to enforce disk quotas and to avoid long-standing write-failure modes in the union file systems used by containers. The TMPDIR environment variable points to a writable scratch file system. The scratch file system is writable, but is mounted with an option to prevent execution of files within it.

Executables can be added to a package in several ways:

executables included in package dependencies
executable bit set on file in archive
Python scripts run by python -m script.py in a subprocess work without regard to executable bit or location

WorkFloe System Tags

When running in Orion, a WorkFloe can have system tags applied to it that denote some of the properties of the Floe.

Example of applying a system tag:

from floe.api import WorkFloe


job = WorkFloe(
    "Example System Tags Floe",
    title="Example System Tags Floe",
    uuid="<uuid>",
    description="Add a system tag to a Floe",
)
job.tags = ["Orion", "<system tag here>"]

System Tags for Fields

AnalyzeReady.AddsOrReplacesFields
AnalyzeReady.AddsRecords
3DReady.AddsRecords
3DReady.AddsOrReplacesFields

System Tags for Services

ServiceReady.Requires.{maas,mmds,fastrocs}
ServiceReady.Uses.{maas,mmds,fastrocs}

Specifier	Meaning
`[blank]`	Nothing specified or excluded
`t2.medium`	Only match t2.medium
`c5`	Match any c5*
`g,!g4`	Match any g, except for g4
`!g,g4`	Exclude g. The inclusion of g4 is overridden by the negation
`!t2`	Exclude t2*
`m,c5,z1`	Match any m, c5, or z1*
`g,!g3.8,!g3.16`	Match any g, except for g3.8 and g3.16*