Orion Integration

Floe was designed and implemented to be a core part of Orion. This documentation describes the integration between the Floe Python API and Orion.

Development Lifecycle

In general, the process of developing a cube is as follows:

  1. The cube is written in Python on a developer’s machine.

  2. Unit tests are written to test the cube’s logic.

  3. The cube source code (with any non-pip installed dependencies) is archived into a tar.gz file.

  4. The archive is uploaded to Orion.

WorkFloes may also be developed in a similar fashion.

Packaging

A valid Floe package is an archive (tar.gz) containing Python and JSON files. Both cubes and WorkFloes may be included in a package. Other file types, such as executable binaries or auxiliary data, may also be included.

Note

Be sure that any cubes in a package are importable from the package’s top level directory.

Below is an example directory structure for a Floe package archive containing WorkFloes and cubes:

project/
    cubes/
        some_cubes.py
        some_other_cubes.py
    floes/
        myfloe.py
        myotherfloe.py
    requirements.txt
    manifest.json

This example package contains a requirements.txt file to specify Python dependencies, as well as a manifest.json file, which is discussed in the following section.

Note

OSX Users must set the COPYFILE_DISABLE environment variable to 1 before creating an archive with tar. Otherwise, invalid Python files with .py extensions will be included in the archive (due to the AppleDouble format) leading to package inspection errors.

Note

Any files named setup.py are assumed to be for packaging purposes and will not be searched for cubes.

Manifest

Every Floe package must contain a manifest.json file containing information that describes the package.

Example manifest.json:

{
  "requirements": "requirements.txt",
  "name": "PSI4",
  "conda_dependencies": ["psi4=0.3.533", "numpy"],
  "conda_channels": ["psi4"],
  "version": "0.1.1",
  "base_os": "amazonlinux2",
  "python_version": "3.7.3",
  "build_requirements": {"memory": 4096},
  "documentation_root": "docs",
  "classify_floes": true,
  "debug": true,
  "deduplicate_files": true,
  "show_biggest_files": true,
  "show_home_files": true,
  "show_final_environment": true,
  "show_json_output": true,
  "use_mamba_solver": true,
  "prebuild": "prebuild.sh",
  "postbuild": "postbuild.sh"
}
  • requirements must contain a relative path (from manifest.json) to a valid pip requirements file.

  • name must contain a string name of the package.

  • version must contain a string version for the package.

  • python_version must contain the version of the desired Python interpreter (Python 3 only).

  • conda_dependencies, if provided, must be a list of packages installable from conda.

  • conda_channels, if provided, must be a list of valid conda channels.

  • build_requirements, if provided, must be a dictionary specifying the required hardware for building and inspecting the package. Actual hardware resources are determined using the Dominant Resource Factor.

    • cpu, if provided, must be an integer (number of CPUs). Default is 1.

    • gpu, if provided, must be >=1 (GPU required for package installation) or 0 (GPU not required to build and inspect the package). Default is 0.

    • memory, if provided, must be an integer (number of mebibytes). Default is 2048.

    • disk_space, if provided, must be an integer (number of mebibytes). Default is 256.

    • instance_type, if provided, must be a string of the form described at Instance Type.

    • instance_tags, (currently unused) if provided, must be a list of strings associated with Orion instance groups.

  • documentation_root, if provided, must be the relative path to a directory containing at minimum an index.html file.

  • base_os, specifies which operating system to build the package environment from. The options are ubuntu-18.04, ubuntu-20.04, and amazonlinux2, with the default operating system being ubuntu-20.04.

  • classify_floes, if set to true, will organize floes from this package in the Orion UI by their classification (shown as “Category” in UI).

  • debug, enables debug mode during package inspection. The package logs will contain more information, along with any errors, default: false

  • deduplicate_files, replace duplicate files with hard links, default: false

  • show_biggest_files, run several du and find commands to highlight the largest directories and files, default: false

  • show_home_files, list all files in the home directory, default: false

  • show_final_environment, export and show final environment, default: false

  • show_json_output, conda output in JSON without progress indicators, default: false

  • use_mamba_solver, install and enable the more efficient mamba dependency solver (mamba and its deps are later uninstalled), default: false

  • prebuild, if provided, must be a relative path (from manifest.json) to a bash script that will be executed as root before pip and all package dependencies are installed. This script is only executed when uploading a package that is building in Orion (i.e. uploading a .tar.gz package)

  • postbuild, if provided, must be a relative path (from manifest.json) to a bash script that will be executed as the floe user after all dependencies have been installed. This script is only executed when uploading a package that is building in Orion (i.e. uploading a .tar.gz package)

Note

Do not add unnecessary dependencies to your package. Each dependency added increases the size of the environment created in Orion, which also increases the start-up cost for WorkFloes.

Complex dependencies can cause conda to use large amounts of memory and time when building a package. While conda has improved, it may be necessary to increase the memory in build_requirements. Pinning each package to a specific channel <CHANNEL>::<PACKAGE> should help conda use fewer resources. Pinning also helps ensure it does not accidentally pick the wrong package source.

Package OS and GCC Versions

Note

Orion 2021.1 will update the previous default choice for base_os from ubuntu-14.04 to ubuntu-18.04. Python dependencies which require compilation may be impacted, as the corresponding GCC version installed within the environment has also changed.

The choices for a manifest’s base_os, the corresponding GCC version and GLIBC version, and Orion version range is shown in the table below.

Operating System

GCC Version

GLIBC Version

Orion Releases

Compatible OpenEye-Toolkits

ubuntu-14.04

4.8

2.19

2018.4-2022.2

<=2023.1.1

ubuntu-18.04

7.4

2.27

2020.3-2022.2

<=2023.1.1

ubuntu-20.04

9.3

2.31

2020.3-2022.1

All

ubuntu-20.04

9.4

2.31

2022.2+

All

ubuntu-22.04

11.4.0

2.35

2023.2+

All

amazonlinux1

4.8

2.17

2020.1-2022.2

<=2023.1.1

amazonlinux2

7.3

2.26

2020.1+

<=2023.1.1

amazonlinux2023

11.4.1

2.34

2023.2+

All

OpenEye-Toolkits Version

GLIBC Min Version

>=2023.2.0

2.28

<=2023.1.1

2.17

Note

To identify the active orion version navigate to the Help/Version Info tab on the web interface.

NVIDIA Driver Versions

A recent version of the NVIDIA driver is provided on GPU instances. The desired version of CUDA should be installed through conda.

Orion Release

NVIDIA Driver Release

2019.5.*

384.81

2020.1.*

440.44

2020.2.*

440.44

2020.3.*

450.51.06

2021.2.*

470.57.02

2022.1.*

470.82.01

2022.2.*

470.103.01

2022.3.*

470.103.01

2023.2.*

525.89.02

Caution

Orion 2020.1 changed how GPU packages are built. First, most GPU packages no longer need a gpu build requirement. Second, the cuda9 instance tag was removed from scaling groups, a change that effectively breaks uploads of older GPU packages. The cuda9 instance tag must be removed from the build requirements. Otherwise, the package upload will fail with the message, “Error processing package: unsatisfiable requirements”. Instead of an instance tag, CUDA should be installed as a conda dependency.

Linting and Detection

Once a package is created with a valid manifest.json and requirements.txt files, you are almost ready to upload. Upon upload, Orion will first construct a virtual environment containing all of the dependencies specified, which can take several minutes. After building the environment, Orion will attempt to discover any cubes and WorkFloes contained in your package. If any errors are encountered while inspecting the package the inspection will fail. You can find these errors ahead of time by running the same detection code that Orion uses.

Example of running cube and WorkFloe detection:

$ cd projectroot
$ floe detect . --out=result.json

The output of the above command, result.json, will contain JSON representing the cubes and Floes discovered, as well as any errors encountered while doing so.

The Floe detection code can also print a simplified output, which contains any encountered errors:

$ cd projectroot
$ floe detect -s .

Before uploading, it is recommended that you check your cubes and Floes for minor issues. Floe comes with a built-in linting tool, called floe lint, which can be used to check many different properties of your cubes and WorkFloes.

$ floe lint --help
usage: floe lint [-h] [-l LEVEL] [-i IGNORE [IGNORE ...]] [-v] [-e] [-j JSON] N [N ...]
Floe Lint

positional arguments:
  N                     List of paths to lint

optional arguments:
  -h, --help            show this help message and exit
  -l LEVEL, --level LEVEL
                        The minimum linting level to report, default 0. Levels 0-4 are the least consequential. Levels >=5 indicate issues
                        that either cause subtle errors or prevent a floe from running at all.
  -i IGNORE [IGNORE ...], --ignore IGNORE [IGNORE ...]
                        List of file patterns to ignore
  -v, --verbose
  -e, --show-errors
  -j JSON, --json JSON  Writes output json to path

Hiding Cubes and WorkFloes

Cube and WorkFloe visibility is done by a naming convention. Any cube or WorkFloe with a name starting with an underscore will be skipped during detection. A common use case for this feature is to have a custom base class.

from floe.api import ComputeCube


class _HiddenBaseCube(ComputeCube):
    """This Cube will not be visible when uploaded to Orion"""
    pass


class MyCustomCube(_HiddenBaseCube):
   """This Cube will be visible"""
    pass

Hardware Requirements

A WorkFloe in Orion can specify hardware requirements and placement strategies on a per cube basis. The Orion scheduler will use the hardware information and placement strategy to place the cubes onto appropriate machines, while attempting to minimize cost and startup time. Hardware requirements and placement strategies are exposed via the parameters listed below. For each parameter listed, the requirements pertain to the cube if serial, or each copy of the cube if parallel.

In order to help maintain fairness in resource scheduling and cost accounting, Orion adjusts the requested amount of hardware according a Dominant Resource Factor.

Example hardware requirement parameter overrides:

from floe.api import ComputeCube

class ExampleCube(ComputeCube):

    parameter_overrides = {
        "gpu_count": {"default": 1},
        "spot_policy": {"default": "Allowed"},
    }

CPUs

The cpu_count parameter specifies how many CPUs are made available to the cube. One potential use case for allocating more than one CPU per cube is to allow cube authors to call out to an existing binary which has multi-CPU parallelism. CPUs provided to cubes are confined using the Linux cpuset facility.

GPUs

The gpu_count parameter specifies how many GPUs are made available to the cube. The allocated GPUs will be available /dev/nvidia[n] for each of the n GPUs provided. Additionally, /dev/nvidiactl, /dev/nvidia-uvm, and /dev/nvidia-uvm-tools are available to the cube.

If a cube’s software requires a GPU during installation, then gpu must be set to 1 or greater under build_requirements of the package Manifest.

Memory

The memory_mb parameter specifies the amount of memory in mebibytes (MiB) to be allocated to the cube. Given system overheads, not all the memory allocated will be available. Consider requesting a couple hundred MiB more than is required. If a cube attempts to use more memory than is available, then the cube will be terminated.

Disk Space

The disk_space parameter specifies the amount of temporary disk space in mebibytes (a mebibyte, or MiB, is 1,048,576 bytes) to be made available to the cube. Given system overheads, not all disk space allocated will be available. Consider requesting a couple hundred MiB more than is required. If a cube attempts to use more space than is available, then system calls should return errors indicating there is no space. See Cube File Systems.

Instance Type

The instance_type parameter specifies a comma-separated list of instance types and families, to be included or excluded for consideration in execution of the cube. They must be prefixes of valid instance type names for healthy groups listed by Orion. Plain prefixes are combined with a logical OR. Negated conditions are combined with a logical AND. Negations override inclusion, regardless of specificity.

Prior to Orion 2024.1.1, negations in this parameter did not always work as expected. For example, c,!c5,!c6 would match m6i.xlarge.

See also, AWS instance type names.

Note

Not all instance types are supported, and customers should usually set other resource requirements instead of instance_type, to maintain a WorkFloe’s ability to run in a variety of circumstances.

Examples:

Specifier

Meaning

[blank]

Nothing specified or excluded

t2.medium

Only match t2.medium

c5

Match any c5*

g,!g4

Match any g*, except for g4*

!g,g4

Exclude g*. The inclusion of g4* is overridden by the negation

!t2

Exclude t2*

m,c5,z1

Match any m*, c5*, or z1*

g,!g3.8,!g3.16

Match any g*, except for g3.8* and g3.16*

Instance Tags

The instance_tags parameter specifies a comma-separated list of strings associated with Orion instance groups, limiting which are allowed to execute the cube. Instance tags are currently unused.

Spot policy

Cubes running in Orion may run on spot market instances to reduce costs. The spot_policy parameter can be used to control the usage of spot market instances. The following policies are available for the cube:

  • Allowed: Spot market instances may be used

  • Required: Spot market instances must be used

  • Preferred: Spot market instances are preferred

  • NotPreferred: Spot market instances are not preferred

  • Prohibited: Spot market instances must not be used

Note

  • It is possible to specify hardware constraints that are impossible to satisfy.

  • If possible, Orion will attempt to add more capacity to itself to satisfy cube requirements.

  • Hardware requirements and placement strategy parameters have no effect when running locally.

Capacity States

Orion provides a capacity state for running cubes, which indicates the status of how Orion cubes are doing with respect to fulfilling their desired resource usage.

See Capacity States for more information.

Additional Parallel Cube Parameters

Maximum Failures Per Work Item

Each work item will be retried up to max_failures times. If that limit is exceeded, the item is skipped. The default is 10.

Autoscale Workers

By default, Orion will autoscale the number of workers for each parallel cube, based on the number of items of work queued. Autoscaling may be turned off by setting autoscale to False.

Maximum Parallel Workers

The maximum number of parallel workers for a Cube (that is, copies) may be bounded by max_parallel. The default is 1000.

Minimum Parallel Workers

The minimum number of parallel workers for a cube (that is, copies) may be bounded by min_parallel. The default is 0.

Experimental Parallel Cube implementation

Advanced users may utilize the experimental parallel cube parameter in Orion to enable an experimental parallel cube implementation that avoids the 12 hour time limit for a single item of work.

Job Scheduling

A job is a WorkFloe instantiated in Orion. Given a perfectly elastic set of resources, each job will be dynamically provisioned with as much hardware as it can use. However, there are limits to resource availability, so Orion schedules job resources according to a notion of fairness that takes into account how much a job has been slowed down by other jobs.

Slowdown Ratio:

1 - (resources a task has) / (resources a task would use given all healthy instances)

  • If the ratio is positive, this task is being blocked by other tasks.

  • If the ratio is 0, this task is not blocked at all.

  • If the ratio is negative, this task has more resources than it needs.

Jobs are placed in a descending queue, where jobs closer to the head of the queue are more likely to get resources. Queue position is determined by the resources used by of all of the jobs owned by the same user, as well as the resources of the job itself. Therefore, no user should be able to monopolize resources for extended periods of time.

Example:

Job A has 75 containers and wants 100

Job B has 2 containers and wants 3

Slowdown of Job A is 1 - (75/100) = .25

Slowdown of Job B is 1 - (2/3) = .33

Job B will be given resources first, then Job A

A job’s queue position may be affected by adjusting its weight. Modifying the weight should only be considered in exceptional circumstances.

Job Logs

When debugging floes and cubes, the job logs contain most of the information available, including: - the environment log generated while processing the package - stdout and stderr messages from all cubes - the job specification and parameters - job scaling activity - a graph of metrics collected (metrics collection can be specified when starting the floe)

The job logs can be viewed in a running floe or exported from a completed floe. There are slight differences between running and completed logs owing to some post-completion processing. On the primary Orion stack, job logs are retained for one month; on customer stacks the retention time may be longer. Viewing or exporting job logs does not carry any significant performance implications, although printing significant amounts of data to stdout is not recommended. The primary cost associated with job logs is S3 storage cost.

Cube File Systems

Beginning in Orion 2020.1, the home directories mounted into running Cubes are read-only. This was necessary to enforce disk quotas and to avoid long-standing write-failure modes in the union file systems used by containers. The TMPDIR environment variable points to a writable scratch file system. The scratch file system is writable, but is mounted with an option to prevent execution of files within it.

Executables can be added to a package in several ways:

  • executables included in package dependencies

  • executable bit set on file in archive

  • Python scripts run by python -m script.py in a subprocess work without regard to executable bit or location

WorkFloe System Tags

When running in Orion, a WorkFloe can have system tags applied to it that denote some of the properties of the Floe.

Example of applying a system tag:

from floe.api import WorkFloe


job = WorkFloe(
    "Example System Tags Floe",
    title="Example System Tags Floe",
    uuid="<uuid>",
    description="Add a system tag to a Floe",
)
job.tags = ["Orion", "<system tag here>"]

System Tags for Fields

  • AnalyzeReady.AddsOrReplacesFields

  • AnalyzeReady.AddsRecords

  • 3DReady.AddsRecords

  • 3DReady.AddsOrReplacesFields

System Tags for Services

  • ServiceReady.Requires.{maas,mmds,fastrocs}

  • ServiceReady.Uses.{maas,mmds,fastrocs}