Orion Integration
Floe was designed and implemented to be a core part of Orion. This documentation describes the integration between the Floe Python API and Orion.
Development Lifecycle
In general, the process of developing a cube is as follows:
The cube is written in Python on a developer’s machine.
Unit tests are written to test the cube’s logic.
The cube source code (with any non-pip installed dependencies) is archived into a tar.gz file.
The archive is uploaded to Orion.
WorkFloes may also be developed in a similar fashion.
Packaging
A valid Floe package is an archive (tar.gz) containing Python and JSON files. Both cubes and WorkFloes may be included in a package. Other file types, such as executable binaries or auxiliary data, may also be included.
Note
Be sure that any cubes in a package are importable from the package’s top level directory.
Below is an example directory structure for a Floe package archive containing WorkFloes and cubes:
project/
cubes/
some_cubes.py
some_other_cubes.py
floes/
myfloe.py
myotherfloe.py
requirements.txt
manifest.json
This example package contains a requirements.txt file to specify Python dependencies,
as well as a manifest.json
file, which is discussed in the following section.
Note
OSX Users must set the COPYFILE_DISABLE
environment variable to 1
before creating an archive with tar
. Otherwise, invalid Python files with .py
extensions
will be included in the archive (due to the AppleDouble format) leading to package inspection errors.
Note
Any files named setup.py
are assumed to be for packaging purposes and will not be searched for cubes.
Manifest
Every Floe package must contain a manifest.json file containing information that describes the package.
Example manifest.json:
{
"requirements": "requirements.txt",
"name": "PSI4",
"conda_dependencies": ["psi4=0.3.533", "numpy"],
"conda_channels": ["psi4"],
"version": "0.1.1",
"base_os": "amazonlinux2",
"python_version": "3.7.3",
"build_requirements": {"memory": 4096},
"documentation_root": "docs",
"classify_floes": true,
"debug": true,
"deduplicate_files": true,
"show_biggest_files": true,
"show_home_files": true,
"show_final_environment": true,
"show_json_output": true,
"use_mamba_solver": true,
"prebuild": "prebuild.sh",
"postbuild": "postbuild.sh"
}
requirements
must contain a relative path (from manifest.json) to a valid pip requirements file.name
must contain a string name of the package.version
must contain a string version for the package.python_version
must contain the version of the desired Python interpreter (Python 3 only).conda_dependencies
, if provided, must be a list of packages installable from conda.conda_channels
, if provided, must be a list of valid conda channels.build_requirements
, if provided, must be a dictionary specifying the required hardware for building and inspecting the package. Actual hardware resources are determined using the Dominant Resource Factor.cpu
, if provided, must be an integer (number of CPUs). Default is 1.gpu
, if provided, must be >=1 (GPU required for package installation) or 0 (GPU not required to build and inspect the package). Default is 0.memory
, if provided, must be an integer (number of mebibytes). Default is 2048.disk_space
, if provided, must be an integer (number of mebibytes). Default is 256.instance_type
, if provided, must be a string of the form described at Instance Type.instance_tags
, (currently unused) if provided, must be a list of strings associated with Orion instance groups.
documentation_root
, if provided, must be the relative path to a directory containing at minimum anindex.html
file.base_os
, specifies which operating system to build the package environment from. The options areubuntu-18.04
,ubuntu-20.04
, andamazonlinux2
, with the default operating system beingubuntu-20.04
.classify_floes
, if set to true, will organize floes from this package in the Orion UI by their classification (shown as “Category” in UI).debug
, enables debug mode during package inspection. The package logs will contain more information, along with any errors, default: falsededuplicate_files
, replace duplicate files with hard links, default: falseshow_biggest_files
, run several du and find commands to highlight the largest directories and files, default: falseshow_home_files
, list all files in the home directory, default: falseshow_final_environment
, export and show final environment, default: falseshow_json_output
, conda output in JSON without progress indicators, default: falseuse_mamba_solver
, install and enable the more efficient mamba dependency solver (mamba and its deps are later uninstalled), default: falseprebuild
, if provided, must be a relative path (from manifest.json) to a bash script that will be executed as root before pip and all package dependencies are installed. This script is only executed when uploading a package that is building in Orion (i.e. uploading a .tar.gz package)postbuild
, if provided, must be a relative path (from manifest.json) to a bash script that will be executed as the floe user after all dependencies have been installed. This script is only executed when uploading a package that is building in Orion (i.e. uploading a .tar.gz package)
Note
Do not add unnecessary dependencies to your package. Each dependency added increases the size of the environment created in Orion, which also increases the start-up cost for WorkFloes.
Complex dependencies can cause conda to use large amounts of memory and time when building a package.
While conda has improved, it may be necessary to increase the memory
in build_requirements
.
Pinning each package to a specific channel <CHANNEL>::<PACKAGE>
should help
conda use fewer resources. Pinning also helps ensure it does not accidentally pick the wrong package source.
Package OS and GCC Versions
Note
Orion 2021.1 will update the previous default choice for base_os
from ubuntu-14.04
to ubuntu-18.04
.
Python dependencies which require compilation may be impacted, as the corresponding GCC version installed
within the environment has also changed.
The choices for a manifest’s base_os
, the corresponding GCC version and GLIBC version, and Orion version range is shown in
the table below.
Operating System |
GCC Version |
GLIBC Version |
Orion Releases |
Compatible OpenEye-Toolkits |
---|---|---|---|---|
ubuntu-14.04 |
4.8 |
2.19 |
2018.4-2022.2 |
<=2023.1.1 |
ubuntu-18.04 |
7.4 |
2.27 |
2020.3-2022.2 |
<=2023.1.1 |
ubuntu-20.04 |
9.3 |
2.31 |
2020.3-2022.1 |
All |
ubuntu-20.04 |
9.4 |
2.31 |
2022.2+ |
All |
ubuntu-22.04 |
11.4.0 |
2.35 |
2023.2+ |
All |
amazonlinux1 |
4.8 |
2.17 |
2020.1-2022.2 |
<=2023.1.1 |
amazonlinux2 |
7.3 |
2.26 |
2020.1+ |
<=2023.1.1 |
amazonlinux2023 |
11.4.1 |
2.34 |
2023.2+ |
All |
OpenEye-Toolkits Version |
GLIBC Min Version |
---|---|
>=2023.2.0 |
2.28 |
<=2023.1.1 |
2.17 |
Note
To identify the active orion version navigate to the Help/Version Info tab on the web interface.
NVIDIA Driver Versions
A recent version of the NVIDIA driver is provided on GPU instances. The desired version of CUDA should be installed through conda.
Orion Release
NVIDIA Driver Release
2019.5.*
384.81
2020.1.*
440.44
2020.2.*
440.44
2020.3.*
450.51.06
2021.2.*
470.57.02
2022.1.*
470.82.01
2022.2.*
470.103.01
2022.3.*
470.103.01
2023.2.*
525.89.02
Caution
Orion 2020.1 changed how GPU packages are built. First, most GPU packages no longer need a gpu build requirement.
Second, the cuda9
instance tag was removed from scaling groups, a change that effectively breaks uploads of older
GPU packages. The cuda9
instance tag must be removed from the build requirements. Otherwise, the package upload
will fail with the message, “Error processing package: unsatisfiable requirements”. Instead of an instance tag, CUDA
should be installed as a conda dependency.
Linting and Detection
Once a package is created with a valid manifest.json and requirements.txt files, you are almost ready to upload. Upon upload, Orion will first construct a virtual environment containing all of the dependencies specified, which can take several minutes. After building the environment, Orion will attempt to discover any cubes and WorkFloes contained in your package. If any errors are encountered while inspecting the package the inspection will fail. You can find these errors ahead of time by running the same detection code that Orion uses.
Example of running cube and WorkFloe detection:
$ cd projectroot
$ floe detect . --out=result.json
The output of the above command, result.json
, will contain JSON representing the cubes and Floes discovered, as well as any errors encountered while doing so.
The Floe detection code can also print a simplified output, which contains any encountered errors:
$ cd projectroot
$ floe detect -s .
Before uploading, it is recommended that you check your cubes and Floes for minor issues. Floe comes with a built-in
linting tool, called floe lint
, which can be used to check
many different properties of your cubes and WorkFloes.
$ floe lint --help
usage: floe lint [-h] [-l LEVEL] [-i IGNORE [IGNORE ...]] [-v] [-e] [-j JSON] N [N ...]
Floe Lint
positional arguments:
N List of paths to lint
optional arguments:
-h, --help show this help message and exit
-l LEVEL, --level LEVEL
The minimum linting level to report, default 0. Levels 0-4 are the least consequential. Levels >=5 indicate issues
that either cause subtle errors or prevent a floe from running at all.
-i IGNORE [IGNORE ...], --ignore IGNORE [IGNORE ...]
List of file patterns to ignore
-v, --verbose
-e, --show-errors
-j JSON, --json JSON Writes output json to path
Hiding Cubes and WorkFloes
Cube and WorkFloe visibility is done by a naming convention. Any cube or WorkFloe with a name
starting with an underscore will be skipped during detection.
A common use case for this feature is to have a custom base class.
from floe.api import ComputeCube
class _HiddenBaseCube(ComputeCube):
"""This Cube will not be visible when uploaded to Orion"""
pass
class MyCustomCube(_HiddenBaseCube):
"""This Cube will be visible"""
pass
Hardware Requirements
A WorkFloe in Orion can specify hardware requirements and placement strategies on a per cube basis. The Orion scheduler will use the hardware information and placement strategy to place the cubes onto appropriate machines, while attempting to minimize cost and startup time. Hardware requirements and placement strategies are exposed via the parameters listed below. For each parameter listed, the requirements pertain to the cube if serial, or each copy of the cube if parallel.
In order to help maintain fairness in resource scheduling and cost accounting, Orion adjusts the requested amount of hardware according a Dominant Resource Factor.
Example hardware requirement parameter overrides:
from floe.api import ComputeCube
class ExampleCube(ComputeCube):
parameter_overrides = {
"gpu_count": {"default": 1},
"spot_policy": {"default": "Allowed"},
}
CPUs
The cpu_count
parameter specifies how many CPUs are made available to the
cube. One potential use case for allocating more than one CPU per cube is to
allow cube authors to call out to an existing binary which has multi-CPU
parallelism. CPUs provided to cubes are confined using the Linux cpuset facility.
GPUs
The gpu_count
parameter specifies how many GPUs are made available to the
cube. The allocated GPUs will be available /dev/nvidia[n] for each of the n
GPUs provided. Additionally, /dev/nvidiactl
, /dev/nvidia-uvm
, and
/dev/nvidia-uvm-tools
are available to the cube.
If a cube’s software requires a GPU during installation, then gpu
must be set to 1 or greater under
build_requirements
of the package Manifest.
Memory
The memory_mb
parameter specifies the amount of memory in mebibytes (MiB) to be
allocated to the cube. Given system overheads, not all the memory allocated
will be available. Consider requesting a couple hundred MiB more than is required.
If a cube attempts to use more memory than is available, then the cube will
be terminated.
Disk Space
The disk_space
parameter specifies the amount of temporary disk space in
mebibytes (a mebibyte, or MiB, is 1,048,576 bytes) to be made available to the cube. Given system overheads, not all
disk space allocated will be available. Consider requesting a couple hundred MiB
more than is required. If a cube attempts to use more space than is available,
then system calls should return errors indicating there is no space.
See Cube File Systems.
Instance Type
The instance_type
parameter specifies a comma-separated list of
instance types and families, to be included or excluded for consideration
in execution of the cube. They must be prefixes of valid instance type names
for healthy groups listed by Orion. Plain prefixes are combined with a logical OR.
Negated conditions are combined with a logical AND. Negations override inclusion,
regardless of specificity.
Prior to Orion 2024.1.1, negations in this parameter did not always work as expected.
For example, c,!c5,!c6
would match m6i.xlarge
.
See also, AWS instance type names.
Note
Not all instance types are
supported, and customers should usually set other resource requirements instead
of instance_type
, to maintain a WorkFloe’s ability to run in a variety
of circumstances.
Examples:
Specifier |
Meaning |
---|---|
|
Nothing specified or excluded |
|
Only match t2.medium |
|
Match any c5* |
|
Match any g*, except for g4* |
|
Exclude g*. The inclusion of g4* is overridden by the negation |
|
Exclude t2* |
|
Match any m*, c5*, or z1* |
|
Match any g*, except for g3.8* and g3.16* |
Spot policy
Cubes running in Orion may run on spot market instances to reduce costs. The
spot_policy
parameter can be used to control the usage of spot market
instances. The following policies are available for the cube:
Allowed
: Spot market instances may be usedRequired
: Spot market instances must be usedPreferred
: Spot market instances are preferredNotPreferred
: Spot market instances are not preferredProhibited
: Spot market instances must not be used
Note
It is possible to specify hardware constraints that are impossible to satisfy.
If possible, Orion will attempt to add more capacity to itself to satisfy cube requirements.
Hardware requirements and placement strategy parameters have no effect when running locally.
Capacity States
Orion provides a capacity state for running cubes, which indicates the status of how Orion cubes are doing with respect to fulfilling their desired resource usage.
See Capacity States for more information.
Additional Parallel Cube Parameters
Maximum Failures Per Work Item
Each work item will be retried up to max_failures
times. If that limit is exceeded,
the item is skipped. The default is 10.
Autoscale Workers
By default, Orion will autoscale
the number of workers for each parallel cube, based on
the number of items of work queued. Autoscaling may be turned off by setting autoscale
to False.
Maximum Parallel Workers
The maximum number of parallel workers for a Cube (that is, copies) may be bounded by max_parallel
.
The default is 1000.
Minimum Parallel Workers
The minimum number of parallel workers for a cube (that is, copies) may be bounded by min_parallel
.
The default is 0.
Experimental Parallel Cube implementation
Advanced users may utilize the experimental
parallel cube parameter in Orion to enable an experimental parallel cube implementation that avoids the 12 hour time limit for a single item of work.
Job Scheduling
A job is a WorkFloe
instantiated in Orion. Given a perfectly elastic set of resources, each job
will be dynamically provisioned with as much hardware as it can use. However, there are limits
to resource availability, so Orion schedules job resources according to a notion of fairness
that takes into account how much a job has been slowed down by other jobs.
- Slowdown Ratio:
1 - (resources a task has) / (resources a task would use given all healthy instances)
If the ratio is positive, this task is being blocked by other tasks.
If the ratio is 0, this task is not blocked at all.
If the ratio is negative, this task has more resources than it needs.
Jobs are placed in a descending queue, where jobs closer to the head of the queue are more likely to get resources. Queue position is determined by the resources used by of all of the jobs owned by the same user, as well as the resources of the job itself. Therefore, no user should be able to monopolize resources for extended periods of time.
Example:
Job A has 75 containers and wants 100
Job B has 2 containers and wants 3
Slowdown of Job A is 1 - (75/100) = .25
Slowdown of Job B is 1 - (2/3) = .33
Job B will be given resources first, then Job A
A job’s queue position may be affected by adjusting its weight. Modifying the weight should only be considered in exceptional circumstances.
Job Logs
When debugging floes and cubes, the job logs contain most of the information available, including: - the environment log generated while processing the package - stdout and stderr messages from all cubes - the job specification and parameters - job scaling activity - a graph of metrics collected (metrics collection can be specified when starting the floe)
The job logs can be viewed in a running floe or exported from a completed floe. There are slight differences between running and completed logs owing to some post-completion processing. On the primary Orion stack, job logs are retained for one month; on customer stacks the retention time may be longer. Viewing or exporting job logs does not carry any significant performance implications, although printing significant amounts of data to stdout is not recommended. The primary cost associated with job logs is S3 storage cost.
Cube File Systems
Beginning in Orion 2020.1, the home directories mounted into running Cubes are read-only.
This was necessary to enforce disk quotas and to avoid long-standing write-failure modes in the
union file systems used by containers. The TMPDIR
environment variable points to a writable
scratch file system. The scratch file system is writable, but is mounted with an option to prevent
execution of files within it.
Executables can be added to a package in several ways:
executables included in package dependencies
executable bit set on file in archive
Python scripts run by python -m script.py in a subprocess work without regard to executable bit or location