Ionic Funding Model and Server Service

Overview

This document describes how the Ionic cluster and its associated Server Service is administered and managed including:

  • how faculty may contribute to the cluster in exchange for increased levels of access,
  • when and how servers are added and ultimately removed from the cluster, and
  • an overview on how CS Staff manages the cluster.

In this document, the terms server and node are basically interchangeable. Server is generally used when referring to physical devices (e.g., purchasing, maintaining, billing, etc.) while node is generally used when referring to a computational resource (e.g., job scheduling).

Ionic Cluster

The Ionic cluster is a beowulf cluster operated and maintained by CS Staff as a departmental resource for the benefit of those folks with regular CS accounts (as opposed to limited CS accounts). Technical details on how to gain access and use the Ionic cluster are found on the CS Cluster Computing page and in the Cluster FAQ.

From time-to-time, new servers are added to Ionic and old servers are removed. As a result, Ionic is heterogeneous with respect to the precise make and model (and, therefore, specifications) of the constituent servers.

Funding Model

Ionic is funded by CS research groups through up-front contributions to purchase new server hardware and then ongoing contributions to cover the costs associated with operating and maintaining the cluster. Additionally, the CS Department also contributes to Ionic via initial and ongoing contributions.

The ongoing contributions are recovered through the Server Service. The Server Service is a flat-rate (per server-day) billing of most (but not all) of the expenses associated with operating and maintaining servers for the Ionic cluster.

Resource Pooling

While we encourage new research groups to enable sharing for their partitions, this is not currently a requirement. Enabling sharing gives access to a larger set of resources for short-lived jobs and is described in more detail below (see: Using the Ionic Cluster). There are pros and cons to pooling resources for short-lived jobs and this is best discussed with CS Staff when contemplating a contribution to the Ionic cluster (see: Note 1).

Server Hardware

Those that wish to make hardware contributions to the Ionic cluster must coordinate with CS Staff to ensure that the equipment/purchase will meet purchasing requirements (e.g., vendor, warranty), data center physical limitations (e.g., power, cooling, size, and networking), and compatibility with the existing Ionic cluster environment (e.g., operating system, remote management, reliability, CS Staff maintainability).

Server Sponsorship and Longevity

Servers are "sponsored" by the group that covers their ongoing Server Service charges and server-specific repair costs. Generally, this is the group that covered the cost of the initial purchase. Sponsorship can be transferred; on the date of transfer, recurring costs and associated benefits transfer to the new group.

Generally, we expect servers to remain a part of the Ionic cluster until they are five years old at which point they will be scheduled for removal (See: Note 2). At the discretion of CS Staff, servers older than five years (and which have ongoing sponsorship) may remain in the cluster. When CS Staff schedules otherwise working servers for removal, advance notice will be given to the sponsor and reasonable accommodations will be made to limit disruption to research needs.

Sponsors can request that a server be removed from the Ionic cluster and returned to them. This can happen, for example, when their research needs are no longer aligned with having the server in the cluster or when repair costs for a failed server become excessive.

CS Staff may remove servers if they cannot be repaired or are no longer supported by the system software. Additionally, the data center housing the Ionic cluster has space, power, and cooling limitations. Any influx of new hardware may, by necessity, force the early retirement of otherwise working older/less-capable hardware.

Using the Ionic Cluster

Computational jobs are scheduled on the Ionic cluster using the Slurm Workload Manager. When a user with a regular CS login account enables their access to the Ionic cluster, they will initially be associated with (only) the "allcs" Slurm account. Note that the term "account" is overloaded: a CS user (login) account corresponds to a particular person and the account name is that person's netid; a Slurm account is an "association" of, among other things, a group of user accounts. CS user accounts can be associated with multiple Slurm accounts.

Research groups that sponsor servers are assigned a name that is used for both the name of the Slurm account and the name of a Slurm partition. The Slurm partition is the set of servers sponsored by the research group. New group members who have already enabled their cluster access are added to the group's Slurm account through a request by an existing member of the group to CS Staff. Once this request is processed, that new group member will have full access to the group servers when they specify the group's Slurm account for a job.

Department-sponsored servers are associated with the "allcs" Slurm account and are in the "cs" Slurm partition. Note that, for this special case, the name of the Slurm account does not match the name of the partition.

When a user submits a computational job to the Ionic cluster, Slurm will determine first where a job can run and then establish (and periodically update) a priority for when it will run.

Server Access (Where will the job run?)

The Slurm partition (i.e., the named subset of server nodes) where a submitted job runs depends on the requested wall-clock time and the requested Slurm account. This subset can be further restricted by command-line options pertaining to memory size, GPU, CPU, or even requests for specific nodes. A user may specify any Slurm account for which they are a member. If one does not specify a Slurm account, the "allcs" account (for which every user is a member) is used as the default.

Jobs Requesting >1 Hour of Wall-Clock Time

If one submits a job requesting more than one hour of wall-clock time, the servers where that job could run are those associated with the Slurm account specified on the job submission command. Note that every job submission will use a specific Slurm account. If a Slurm account is not specified explicitly, the nominal default is the "allcs" Slurm account.

For example, explicitly specifying the Slurm account "groupX" will restrict the servers available to those associated with the "groupX" research group.

Note that if the "allcs" Slurm account is specified (either explicitly or by default), the available servers will be those that have been provided at the departmental level (and available to anyone with a Slurm login account).

Jobs Requesting ≤1 Hour of Wall-Clock Time

A key property of a Slurm account that affects "where" jobs could run is whether the Slurm account has sharing enabled.

If one submits a job requesting ≤1 hour of wall-clock time and specifies a Slurm account where sharing is not enabled, then the servers where that job could run are those associated with the Slurm account specified on the job submission command.

If one submits a job requesting ≤1 hour of wall-clock time and specifies a Slurm account where sharing is enabled, then the set of servers where that job could run is the union of all the servers in the Ionic cluster associated with Slurm accounts that also have sharing enabled.

Job Priority Model (When will the job run?)

The scheduling and job priority details of the Slurm Workload Manager are complex. What is described in this section is not authoritative and is only meant to provide high-level guidance and understanding of both the Slurm scheduler and our configuration for the Ionic cluster.

In our configuration, we are using the "FairShare" scheduler with "back-filling" enabled.

While the finer details of the scheduler are beyond the scope of this document, the schedule configuration file is available for inspection.

At a high-level, the FairShare scheduler aims to provide fairness in access to resources across Slurm accounts (representing research groups) and individual users.

Submitted jobs derive their priority from a variety of factors such as: how long a job has been waiting to start (while jobs are in the queue, their relative priority steadily increases over time so none are starved), how many jobs the requesting user has pending, and the recent historical resource usage of the requesting user.

Additionally, if one submits a job requesting ≤1 hour of wall-clock time and specifies a Slurm account where sharing is enabled (thus scheduled on the set of all shared servers), that job will have a priority boost applied.

This boost is based on the "value" of the Slurm account's in-service servers. This value is calculated from the initial cost of each sponsored server as well as the age of each server. That is, the cost of a server is a proxy for its initial value and this value decays over time. The value (i.e., the up-front contribution) of the shared server hardware is used as an input to calculation determining the amount of enhanced access a job is ultimately given.

While jobs for a specific group's queue do have conceptual priority over jobs not submitted via the group's Slurm account (i.e., sub-one-hour jobs running on shared nodes), we do not preempt jobs on the Ionic cluster. Therefore, any subsequent job requiring in-use resources (regardless of relative priority) will wait.

Because the Ionic cluster uses a FairShare scheduler, there are some transient cases where, to an outside observer, a job requesting the same resources as a seemingly lower priority job might start before a seemingly higher priority job. This can occur, for example, if user A has jobs in the queue and has recently used a lot of resources (temporarily depleting some of their "share"). If user B (who has been idle in the recent past) then submits a job, they may have just enough "share," when combined with other factors, they could appear to jump the line. This kind of situation will soon resolve itself as user B's "fair share" will go down as jobs complete and user A's share will increase because its use has gone down.

It's important to keep in mind that Slurm has a back-fill scheduler. This means, for example, that if the scheduler determines that each of the jobs in a group's queue are blocked waiting for resources and there happens to be an otherwise unused gap of time in the schedule that can accommodate a newly submitted job (i.e., run to completion without delaying "higher-priority" jobs), such a job will appear to start and finish before an otherwise "higher-priority" job starts. From the perspective of an outside observer, this could appear that priorities are not always being applied "correctly."

Server Service

The Server Service is a flat-rate, per server-day recharge center fee that encapsulates most of the ongoing costs associated with operating and maintaining servers in the Ionic cluster (See: Note 3). The rate is established for each fiscal year based on estimated expenses for the year as well as any carried-over deficit or surplus from the Server Service from the prior year. The rate is in units of dollars per server-day and is billed monthly. For the actual rate, see the Rate Sheet.

Effort and Costs that are Included in the Server Service

The following items are generally included on a best-effort basis in the Ionic Server Service:

  • Assistance with specifying and purchasing servers for Ionic
  • Hardware deployment
  • Normal, reasonable hardware maintenance staff time
  • Normal, recurring system administration (including periodic OS updates)
  • Hardware costs for "standard" equipment associated with Ionic (e.g., normal top of rack switches).
  • Staff time for discussing/implementing/documenting modifications to the service itself
  • CS Staff initiated decommissioning, removal, and disposal of equipment at the HPCRC

Effort or Costs that are Excluded from the Server Service

The following items are excluded from the Server Service:

  • Repair costs (time and materials) for out-of-warranty hardware
  • Staff time for group-specific RT tickets (e.g., cluster access addition/removal, usage coaching, etc.)
  • "Emergency" staff response to node problems (i.e., specific requests to attend to a failed node immediately)
  • Dedicated equipment (e.g., non-standard networking equipment) specific to a research group
  • Research-specific software installations
  • Transferring/moving equipment to/from the HPCRC

For cases, above, when CS Staff is able (e.g., has the capacity) to perform the work, staff time will be charged as Professional Services to the group associated with the requester or sponsoring the specific server(s).

Related Services and Fees

The other service offerings from CS Staff are described on the About the Computing Facilities Funding Model page. This section describes how they intersect with users of the Ionic cluster.

User Accounts

Ionic users will need a regular CS account. Members of the CS department already have such account fees paid for by the department. Collaborators outside the CS department generally have their account fees paid by a CS faculty member.

Note that CS affiliated faculty cannot directly sponsor CS accounts but will need a regular CS faculty member to do so. Accounts fees can be paid with any valid University chartstring.

Storage

Ionic jobs generally need to use project space for storage. This is charged separately.

Professional Services

The "server service" is not all-inclusive of anything having to do with Ionic. It is meant to level-out behind-the-scenes costs (e.g., OS upgrades) over the fiscal year. Support tickets that are already answered in the CS Guide or are related to the specific needs of a research group will generally be charged as Professional Services time to the relevant group.

Cluster System Administration

CS Staff has developed local best practices based on the work of others who manage GPU clusters, our person-decades of experience, the specific equipment in the cluster, and the specific environment in which the cluster operates.

The goal of CS Staff is to maximize availability, utility, and fairness for the researchers using the Ionic cluster given the resources and constraints at hand. At a strategic level, we work to address such resource limitations and other constraints over the long-term.

This section provides some high-level motivation and considerations that inform how we administer the cluster.

Resuming Failed Nodes

Except for "node401" through "node403," (see below) a failure of any single GPU in a node will take that entire node out of service. This is because the GPUs cannot be individually reset live and the scheduler cannot determine which GPU failed. Therefore, the node has to be removed from service to prevent job failures. When a node fails, CS Staff performs a manual intervention to bring it back online. This intervention involves an assessment of the conditions that led to the failure. Depending on that assessment, the intervention may include a physical trip to the data center before the node can be brought back online.

Data-Center-Class GPUs

The GPUs on "node401" through "node403" are data-center-class. On these systems, each GPU can be individually reset while the system is running so most common GPU failures do not result in a node outage.

Node Health Checks

Because hardware failures are a fact of life in clusters, we run a Node Health Check (NHC) every 5 minutes and at the start/end of a job. We use the LBNL Node Health Check (NHC) developed at Lawrence Berkeley National Laboratory as our framework.

If a health check indicates trouble with a node, it will set the Slurm state of the node to "draining." Similarly, if a subsequent health check then finds no trouble, the Slurm state will be reset back to normal. While in the "draining" state, the scheduler will not schedule new jobs on that node. It is important to note (1) that existing work on that node will continue to run, (2) a node could toggle in and out of the draining state during the lifetime of a job, and (3) if the running jobs are already consuming enough resources that Slurm couldn't schedule anything anyway (e.g., jobs on the node are already using all the memory), then the draining state has no impact on the availability of the node even if there are other unused resources on that node.

Said another way, if a health check sets the state of a node to "draining" then the "reported" unavailable resources due to the health check are all resources not allocated on the node while in this state. However, if any of the resource types (i.e., CPU, memory, GPUs) are already fully in use, then the other resources are effectively unavailable anyway. Therefore, "reported" unavailable resources due to the health check, is a conservative upper bound.

Temperature-based Controls

Based on experience, higher GPU temperatures correlate with GPU failures. Because of this, one of the parameters we measure with the periodic node health checks is the temperature reported by each GPU. While the units for temperature are nominally in degrees, it's best to simply consider it as a number where higher values correlate with increasing likelihood of GPU failure.

Our control mechanism has three high-level knobs that adjust the behaviour of when we mark a node as "draining" due to temperature:

  1. Temperature Threshold for GPU readings in a given node
  2. Percentage of GPUs above the threshold in a given node
  3. The sampling period of the measurements

If it looks like we are getting into a regime where at least one GPU in a node is close to failing (causing the entire node to fail) before the next temperature measurement, we mark the node as "draining" -- work continues but nothing more is scheduled on the node while in that state. It will be checked again at the next sampling period.

Note that these settings are trying to predict if a failure is likely in the near future. Based on experience, when more GPUs in a node are above the temperature threshold, there is an increased likelihood that a GPU in that node will fail. Anecdotally, heat from adjacent GPUs seems to effect other GPUs in the node.

As of this writing, we are using a sampling period of 5 minutes and a 40% GPU percentage. The temperature thresholds are set on a per-node basis which allows for customization based on the node type, the node's position in the rack, and the type of GPUs in the node.

While this control mechanism is a bit ad hoc, we have found it effective in finding a stable and robust sweet-spot that appears to do a good job of maximizing cluster availability

End-User Visibility

The binary nature of "draining" or not draining, can potentially mislead folks about the availability of the cluster. Just because a node is in the draining state, it doesn't necessarily mean that there's a loss of availability. Here is an excerpt of the output of a particular sinfo incantation:

NODES PARTITION CPUS(A/I/O/T) GRES            GRES_USED                MEMORY  ALLOCMEM REASON                                  
1     visualai  64/0/32/96    gpu:rtx_3090:10 gpu:rtx_3090:8(IDX:0-7)  515096  233472   NHC: check_nv_smi_temp: 5 GPUs are overh
1     pvl       90/0/6/96     gpu:rtx_3090:10 gpu:rtx_3090:10(IDX:0-9) 515096  409600   NHC: check_nv_smi_temp: 6 GPUs are overh

The visualai line describes a node that has 10 GPUs (the ":10" at the end of the GRES value) where 8 GPUs are in use (the ":8" in the GRES_USED value) and 5 GPUs are in an overheat state. Because 5 >= 40% of 10, this node is in the draining state and the 2 GPUs not in-use are considered "held for temperature" and unavailable.

The pvl line describes a node that has 10 GPUs (the ":10" at the end of the GRES value) where all 10 GPUs are in use (the ":10" in the GRES_USED value) and 6 GPUs are in an overheat state. Because 6 >= 40% of 10, this node is in the draining state; however, since all GPUs are in use, none are considered unavailable. Also, because all the GPUs are in use, this node would not be available for GPU job even if not in the "draining" state. If the profile of the job on this node follows the typical pattern of load input data, process data, save output data, it's quite possible that the GPUs could cool down and measured by the node health check while still in the "save output data" phase in which case the node could return to the normal state before the job completes.

End Notes

Note 1: Node Sharing

Currently, all research groups (as well as the department-sponsored servers) are participating in sharing except those with the Slurm account names pci, vertaix, and visualai.

Note 2: Lifetime

Existing server hardware that turns five years old or older in FY2025 may remain in the cluster until 2026-01-31.

Note 3: Non-Ionic Nodes Covered by the Server Service

As of April 2025, there are a few servers not in Ionic that are billed at the server service rate as they are managed similarly to the Ionic cluster nodes. These grandfathered servers are part of a service CS Staff no longer offers.

/node/6887 built from server_service.md on 2025-04-11 09:26:40