Orion Integration

Floe was designed and implemented to be a core part of Orion. This documentation describes the integration between the Floe Python API and Orion.

Development Lifecycle

In general, the process of developing a cube is as follows:

  1. The cube is written in Python on a developer’s machine.

  2. Unit tests are written to test the cube’s logic.

  3. The cube source code (with any non-pip installed dependencies) is archived into a tar.gz file.

  4. The archive is uploaded to Orion.

WorkFloes may also be developed in a similar fashion.

Packaging

A valid Floe package is an archive (tar.gz) containing Python and JSON files. Both cubes and WorkFloes may be included in a package. Other file types, such as executable binaries or auxiliary data may also be included.

Note

Be sure that any cubes in a package are importable from the package’s top level directory.

Below is an example directory structure for a Floe package archive containing WorkFloes and cubes:

project/
    cubes/
        some_cubes.py
        some_other_cubes.py
    floes/
        myfloe.py
        myotherfloe.py
    requirements.txt
    manifest.json

This example package contains a requirements.txt file to specify Python dependencies, as well as a manifest.json file which is discussed in the following section.

Note

OSX Users must set the COPYFILE_DISABLE environment variable to 1 before creating an archive with tar. Otherwise, invalid Python files with .py extensions will be included in the archive (due to the AppleDouble format) leading to package inspection errors.

Note

Any files named setup.py are assumed to be for packaging purposes and will not be searched for cubes.

Manifest

Every Floe package must contain a manifest.json file containing information that describes the package.

Example manifest.json:

{
  "requirements": "requirements.txt",
  "name": "PSI4",
  "conda_dependencies": ["psi4=0.3.533", "numpy"],
  "conda_channels": ["psi4"],
  "version": "0.1.1",
  "base_os": "amazonlinux2",
  "python_version": "3.7.3",
  "build_requirements": {"memory": 4096},
  "documentation_root": "docs"
}
  • requirements must contain a relative path (from manifest.json) to a valid pip requirements file

  • name must contain a string name of the package

  • version must contain a string version for the package

  • python_version must contain the version of the desired Python interpreter (Python 3 only)

  • conda_dependencies if provided, must be a list of packages installable from conda

  • conda_channels if provided, must be a list of valid conda channels

  • build_requirements if provided, must be a dictionary of instance type prerequisites

  • gpu if provided, must be >=1 (GPU required for package installation) or 0 (GPU not required for package installation)

  • memory if provided, must be an integer (MiB)

  • disk_space if provided, must be an integer (MiB)

  • instance_type if provided, must be a string of the form described at Instance Type

  • instance_tags (currently unused) if provided, must be a list of strings associated with Orion instance groups

  • documentation_root if provided, must be the relative path to a directory containing at minimum an index.html file

  • base_os specifies which operating system to build the package environment from.

Note

Do not add unnecessary dependencies to your package. Each dependency added increases the size of the environment created in Orion, which also increases the start-up cost for WorkFloes.

Complex dependencies can cause conda to use large amounts of memory and time when building a package. While conda has improved, it may be necessary to increase the memory in build_requirements. Pinning each package to a specific channel <CHANNEL>::<PACKAGE> should help conda uses fewer resources and also helps ensure it does not accidentally pick the wrong package source.

Package OS & GCC Versions

Note

Orion 2021.1 will update the previous default choice for base_os from ubuntu-14.04 to ubuntu-18.04. Python dependencies which require compilation may be impacted, as the corresponding GCC version installed within the environment has also changed.

The choices for a manifest’s base_os, the corresponding GCC version, and Orion version range is shown in the table below.

Operating System

GCC Version

Orion Releases

ubuntu-14.04

4.8

2018.4+

ubuntu-18.04

7.4

2020.3+

ubuntu-20.04

9.3

2020.3+

amazonlinux1

4.8

2020.1+

amazonlinux2

7.3

2020.1+

NVIDIA Driver Versions

A recent version of the NVIDIA driver is provided on GPU instances. The desired version of CUDA should be installed through conda.

Orion Release

NVIDIA Driver Release

2019.5.*

384.81

2020.1.*

440.44

2020.2.*

440.44

2020.3.*

450.51.06

Warning

Orion 2020.1 changed how GPU packages are built. First, most GPU packages no longer need a gpu build requirement. Second, the cuda9 instance tag was removed from scaling groups, a change that effectively breaks uploads of older GPU packages. The cuda9 instance tag must be removed from the build requirements. Otherwise, the package upload will fail with the message, “Error processing package: unsatisfiable requirements”. Instead of an instance tag, CUDA should be installed as a conda dependency.

Linting & Detection

Once a package is created with a valid manifest.json and requirements.txt files, you are almost ready to upload. Upon upload, Orion will first construct a virtual environment containing all of the dependencies specified, which can take several minutes. After building the environment, Orion will attempt to discover any cubes and WorkFloes contained in your package. If any errors are encountered while inspecting the package the inspection will fail. You can find these errors ahead of time by running the same detection code that Orion uses.

Example of running cube & WorkFloe detection:

$ cd projectroot
$ floe detect . --out=result.json

The output of the above command, result.json will contain JSON representing the cubes & Floes discovered, as well as any errors encountered while doing so.

The Floe detection code can also print a simplified output, which contains any encountered errors:

$ cd projectroot
$ floe detect -s .

Before uploading, it is recommended that you check your cubes & Floes for minor issues. Floe comes with a built in linting tool called floe lint which can be used to check many different properties of your cubes and WorkFloes.

$ floe lint --help
usage: floe lint [-h] [-l LEVEL] [-i IGNORE [IGNORE ...]] [-v] [-e] [-j JSON] N [N ...]
Floe Lint

positional arguments:
  N                     List of paths to lint

optional arguments:
  -h, --help            show this help message and exit
  -l LEVEL, --level LEVEL
                        The minimum linting level to report, default 0. Levels 0-4 are the least consequential. Levels >=5 indicate issues
                        that either cause subtle errors or prevent a floe from running at all.
  -i IGNORE [IGNORE ...], --ignore IGNORE [IGNORE ...]
                        List of file patterns to ignore
  -v, --verbose
  -e, --show-errors
  -j JSON, --json JSON  Writes output json to path

Hiding Cubes & WorkFloes

New in version 0.2.181.

Cube & WorkFloe visibility is done by a naming convention. Any cube or WorkFloe with a name starting with an underscore will be skipped during detection. A common use case for this feature is to have a custom base class.

from floe.api import ComputeCube


class _HiddenBaseCube(ComputeCube):
    """This Cube will not be visible when uploaded to Orion"""
    pass


class MyCustomCube(_HiddenBaseCube):
   """This Cube will be visible"""
    pass

Hardware Requirements

A WorkFloe in Orion can specify hardware requirements and placement strategies on a per cube basis. The Orion scheduler will use the hardware information and placement strategy to place the cubes onto appropriate machines while attempting to minimize cost and startup time. Hardware requirements and placement strategies are exposed via the parameters listed below. For each parameter listed the requirements pertain to the cube if serial, or each copy of the cube if parallel.

In order to help maintain fairness in resource scheduling and cost accounting, Orion adjusts the requested amount of hardware according a Dominant Resource Factor.

Example hardware requirement parameter overrides:

from floe.api import ComputeCube

class ExampleCube(ComputeCube):

    parameter_overrides = {
        "gpu_count": {"default": 1},
        "spot_policy": {"default": "Allowed"},
    }

CPUs

The cpu_count parameter specifies how many CPUs are made available to the cube. One potential use case for allocating more than one CPU per cube is to allow cube authors to call out to an existing binary which has multi-CPU parallelism. CPUs provided to cubes are confined using the Linux cpuset facility.

GPUs

The gpu_count parameter specifies how many GPUs are made available to the cube. The allocated GPUs will be available /dev/nvidia[n] for each of the n GPUs provided. Additionally, /dev/nvidiactl, /dev/nvidia-uvm, and /dev/nvidia-uvm-tools are available to the cube.

If a cube’s software requires a GPU during installation, then gpu must be set to 1 or greater under build_requirements of the package Manifest.

Memory

The memory_mb parameter specifies the amount of memory in mebibytes (MiB) to be allocated to the cube. Given system overheads, not all the memory allocated will be available. Consider requesting a couple hundred MiB more than is required. If a cube attempts to use more memory than is available, then the cube will be terminated.

Disk Space

The disk_space parameter specifies the amount of temporary disk space in mebibytes (MiB) to be made available to the cube. Given system overheads, not all disk space allocated will be available. Consider requesting a couple hundred MiB more than is required. If a cube attempts to use more space than is available, then system calls should return errors indicating there is no space. See Cube File Systems.

Instance Type

The instance_type parameter specifies a comma separated list of instance types and families to be included or excluded for consideration to execute a cube on. They must be prefixes of valid AWS instance type names. Not all instance types are supported, and other resource requirements should generally be set instead of instance_type to maintain a WorkFloe’s ability to run in a variety of circumstances.

Examples:

Specifier

Meaning

“”

Nothing specified or excluded

“t2.medium”

Only match t2.medium

“c5”

Match any c5*

“g,!g4”

Match any g*, except for g4*

“!t2”

Exclude t2*

“m,c5,z1”

Match any m*, c5*, or z1*

Instance Tags

The instance_tags parameter specifies a comma separated list of strings associated with Orion instance groups limiting which are allowed to execute the cube. Instance tags are currently unused.

Spot policy

Cubes running in Orion may run on spot market instances to reduce costs. The spot_policy parameter can be used to control the usage of spot market instances. The following policies are available for the cube:

  • Allowed: Spot market instances may be used

  • Required: Spot market instances must be used

  • Preferred: Spot market instances are preferred

  • NotPreferred: Spot market instances are not preferred

  • Prohibited: Spot market instances must not be used

Note

  • It is possible specify hardware constraints that are impossible to satisfy.

  • If possible, Orion will attempt to add more capacity to itself to satisfy cube requirements.

  • Hardware requirements and placement strategy parameters have no effect when running locally.

Additional Parallel Cube Parameters

Maximum Failures Per Work Item

Each work item will be retried up to max_failures times. If that limit is exceeded, the item is skipped. The default is 10.

Autoscale Workers

By default, Orion will autoscale the number of workers for each parallel cube based on the number of items of work queued. Autoscaling may be turned off by setting autoscale to False.

Maximum Parallel Workers

The maximum number of parallel workers for a Cube (i.e. copy) may be bounded by max_parallel. The default is 1000.

Minimum Parallel Workers

The minimum number of parallel workers for a Cube (i.e. copy) may be bounded by min_parallel. The default is 0.

Job Scheduling

A job is a WorkFloe instantiated in Orion. Given a perfectly elastic set of resources, each job will be dynamically provisioned with as much hardware as it can use. However, there are limits to resource availability, so Orion schedules job resources according to a notion of fairness that takes into account how much a job has been slowed down by other jobs.

Slowdown Ratio:

1 - (resources a task has) / (resources a task would use given all healthy instances)

  • If the ratio is positive, this task is being blocked by other tasks.

  • If the ratio is 0, this task is not blocked at all.

  • If the ratio is negative, this task has more resources than it needs.

Jobs are placed in a descending queue, where jobs closer to the head of the queue are more likely to get resources. Queue position is determined by the resources used by of all of the jobs owned by the same user, as well as the resources of the job itself. Therefore, no user should be able to monopolize resources for extended periods of time.

Example:

Job A has 75 containers and wants 100

Job B has 2 containers and wants 3

Slowdown of Job A is 1 - (75/100) = .25

Slowdown of Job B is 1 - (2/3) = .33

Job B will be given resources first, then Job A

A job’s queue position may be affected by adjusting its weight. Modifying the weight should only be considered in exceptional circumstances.

Cube File Systems

Beginning in Orion 2020.1, the home directories mounted into running Cubes are read-only. This was necessary to enforce disk quotas and to avoid long-standing write-failure modes in the union file systems used by containers. The TMPDIR environment variable points to a writable scratch file system. The scratch file system is writable, but is mounted with an option to prevent execution of files within it.

Executables can be added to a package in several ways:

  • executables included in package dependencies

  • executable bit set on file in archive

  • Python scripts run by python -m script.py in a subprocess work without regard to executable bit or location

WorkFloe System Tags

When running in Orion, a WorkFloe can have system tags applied to it that denote some of the properties of the Floe.

Example of applying a system tag:

from floe.api import WorkFloe


job = WorkFloe(
        "Example System Tags Floe",
        title="Example System Tags Floe",
        uuid="<uuid>",
        description="Add a system tag to a Floe",
)
job.tags = ["Orion", "<system tag here>"]

System Tags for Fields

  • AnalyzeReady.AddsOrReplacesFields

  • AnalyzeReady.AddsRecords

  • 3DReady.AddsRecords

  • 3DReady.AddsOrReplacesFields

System Tags for Services

  • ServiceReady.Requires.{maas,mmds,fastrocs}

  • ServiceReady.Uses.{maas,mmds,fastrocs}