Scaling Groups

This page displays Orion scaling group information.

../_images/ScalingGroupShot.png

The Scaling Groups subpage details the various compute scaling groups available to Orion. The two tables below represent spot and nonspot/on-demand EC2 instances.

Scaling Groups

Name

Description

Type

EC2 instance type and specification.

  • GPU types are highlighted in green.
  • Hovering over the infoicon icon with a mouse pointer provides further details
    about the Auto Scaling Group (ASG).
  • Group Name represent either an Internal or AWS group name.
  • Affinity is a Scheduler feature to increase the preference of a group
    (the default is 0).

State

Indicates the state of an ASG.

  • Healthy: New instances can be added.
  • Drain: Due to administrative action, existing instances will not
    be given more work and will be terminated as they finish their current work.
    New instances will not be added.
  • Not Configured: A new group that will attempt to start a new instance
    to finish configuration.
  • Scaling Suspended: Existing instances accept work, but new instances
    cannot be added due to AWS capacity.
  • Failed: Due to persistent errors, existing instances will not be given more
    work and will be terminated. New instances will not be added.
  • Suspect: Non-scaling AWS errors are occurring. Existing instances will
    accept work, but new instances cannot be added. If errors continue, the group
    state will move to “Failed”.

Scaling

ASG policy:

  • Adjusts the desired capacity of the group, between the minimum and maximum capacity values.
  • Launches or terminates the instances as needed. The number of instances increase
    (Up Arrow) or decrease (Down Arrow) dynamically to meet changing conditions.

Details

Shows the state of the ASG and whether it is Active or Deactivated.

Healthy: Number of instances currently available to do process work.

Desired Instances: Number of instances the scheduler would like to have as healthy.

Min. Size: Minimum size of the group (the default is 0).

Max. Size: Maximum size of the group (a useful value to limit spend in Orion).

Usage

Amount of resources currently provided by the ASG.

Cost/Hour

Hourly instance cost. If spot, it will update regularly.

Pool

Scheduler feature used to segregate tasks into separate scaling groups.

Edit

Allows an Orion Stack admin to manage an ASG. Available options
are Min Size, Max Size, Min Reserve, Affinity, and State.

As tasks are submitted to Orion, the scheduler decides where to place the work based on the hardware and spot requirements of the cubes, as well as other factors (such as pool or affinity). As the workload grows, more instances are launched. This is first seen by an increased desired instances count; soon thereafter, the healthy instances should match this. Desired and healthy instances are not allowed to exceed the maximum size.

Once work is complete and Orion starts to scale down, users then see the desired count drop far more quickly than the healthy count. This is for two reasons: (1) those instances are likely still working on their current task, and (2) Orion does not terminate instances immediately after they complete work as startup time is real (several minutes depending on instance type and pricing model), so they remain as hot instances for new work.

If the desired count is higher than the healthy count for a long period, there is probably limited spot availability, or no availability.