CS Cluster Computing

Overview

The department provides a Beowulf cluster, known as ionic, for users who need a high performance computing (HPC) cluster environment in order to perform their work.

The primary way of using the cluster is to submit batch jobs and a description of the resource requirements (CPU, RAM, run time) to the scheduler.  When resources are available, the job will run.

Access

Access to the ionic cluster is restricted to current members of the CS department through their CS account (alumni or limited guest accounts are ineligible).  Before attempting to use the cluster, it is important that you join the beowulf mailing list by visiting this link:

    https://lists.cs.princeton.edu/mailman/listinfo/beowulf

This low-traffic mailing list is used to announce outages or updates to the cluster.  More importantly, joining this list gets you added to the access list necessary to schedule jobs through the Slurm scheduler. It is also the place for users to coordinate usage during critical times such as conference deadlines.  (See the Coordinating with Other Users section below for more details.)

NOTE: The cluster is only accessible to users in the CS department.  If you are interested in using a cluster that can be used collaboratively with users outside the CS department, OIT has several clusters available at the TIGRESS High Performance Computing Center.

Hardware

The heterogeneous ionic cluster is comprised of server hardware from a variety of sources.  It currently occupies three racks in the CS section of the University Data Center.  Each server (also called, nodes) runs Linux and one of the servers is designated the head node.  All other servers are compute nodes.

Network

Each physical rack of the cluster contains an Ethernet switch.  As these are at the top position in the rack they are called, unimaginatively, top-of rack (TOR) switches.  Each server has a 1Gb/sec Ethernet uplink to its TOR switch.  The TOR switch in the rack that contains the head node has a 10 Gb/sec Ethernet uplink to the department's core switch.  The core switch, in turn, has a 10 Gb/sec path to the CS storage system to access home directories and project space.  Racks that do not contain the head node each have a 10 Gb/sec link from their TOR switch to the TOR switch of the rack that does contain the head node.

Node Specifications

As mentioned, the ionic cluster is heterogeneous.  Typically, a few servers are added each year while a few older servers are removed (either due to failure or extreme old age).  As a result, the cluster contains servers of varying specifications.  The best source of information about the hardware specifications for the nodes is the Ganglia page for the cluster at http://ionic.cs.princeton.edu/ and the /proc file system on each node.  If you need specific details about the physical characteristics of a server that isn't available via Ganglia or the /proc filesystem on the node itself, contact CS Staff.

As of March 2017, the ionic cluster consists of a combination of Fujitsu Primergy RX200 systems (12 nodes), Fujitsu Primergy CX2570 systems (2 nodes with 2 NVidia K80 GPUs each), Fujitsu Primergy RX2530 systems (4 nodes), SuperMicro 8016B-TF systems (6 nodes), and Dell PowerEdge R730 systems (16 nodes).  The cluster uses a Fujitsu Primergy RX200 S6 server as a head node. All of the other systems are compute nodes.

The 6 SuperMicro systems were added to the cluster in May 2014. These systems each have 40 CPU cores and 512 GB of RAM, and are available to all users of the cluster. They were purchased under a cost sharing arrangement between the Computer Vision Group and the CS Department. In addition, the CPUs were from a donation by Intel to Prof. Kai Li.

The 16 Dell PowerEdge systems were added to the cluster in March 2017. Each node consists of 28 CPU cores and 256 GB of RAM, and are scheduleable by all department users for jobs with a walltime of one hour or less. These nodes were contributed by Prof. Ben Raphael.

Storage

Compute nodes in the cluster have access to three types of disk storage:

  • local: local disk space that is not used by the operating system is consolidated into a partition at /scratch.  Accessing this storage will have the least network contention.
  • cluster: The /scratch partition of the head node is available to the compute nodes via NFS at /scratch/network.  Accessing this storage will not be affected by network contention outside the cluster.
  • remote: Each node (compute nodes and the head node) mount home directories and project space via NFS from the department's central file system.

The local and cluster storage is intended to be used to stage input files (that are read multiple times) as well as intermediate files. By convention, if you use /scratch or /scratch/network, create and use a subdirectory with your username:

    mkdir -p /scratch/$USER ; cd /scratch/$USER

The local and cluster storage is considered temporary (i.e., "scratch") and users are expected to clean up their files at the end of each job.  Nodes that fail for one reason or another are often re-imaged without warning resulting in a loss of any data stored in the /scratch partition.

Given the nature of the cluster and its network topology, it is possible for poorly designed jobs to overwhelm the network connection back to the remote file system.  If your job will need to access the same remote file multiple times and the data will fit on local disk, we encourage you to structure your code to first copy the data to the node and then access it locally.  This should result in increased performance of not just your job but the other jobs running on the cluster as well.

If you are timing your code, be aware that accessing NFS-mounted file systems introduces both latency and uncertainty. Be sure to time only the computation, otherwise what you will be measuring may be dominated by contention on: the Ethernet switches from the node to the NFS server, the NFS server itself, the SAN switch to the disk arrays, and the disk arrays themselves.

Software

Cluster Software Configuration

The cluster is built using the Springdale (formerly, PUIAS) Linux distribution which is a clone recompile of RedHat Enterprise Linux, with rebranding. Scheduling is done with the Slurm workload manager. MPICH, LAM and other library collections of use in cluster environments are also installed on the cluster. To get started with Slurm, you are encouraged to review the documentation and tutorials available from SchedMD, particularly the Quick Start User Guide. For users familiar with Torque/Maui, switching to Slurm should be a fairly simple process of converting your submission scripts. This page also provides a very quick intro with examples that will help most folks get started with Slurm. MATLAB version R2014a with the parallel toolkit is also available. It is worth noting that our license is 32-seat for the cluster toolkit; you will not be able to use this toolkit and achieve higher parallelism.

Software Packages and Libraries

If you need a software package or library that is not currently installed on the compute nodes, please contact CS Staff.  If it is already available in a repository or as an RPM for our distribution, we should be able to install it within one business day.  Packages not available as an RPM take longer to prepare for our automated configuration management system.

Please note that we might not globally install highly specialized packages (that are unlikely to be used by others in the department) or packages that conflict with already installed packages.  In these cases, you may need to install them locally in your home directory or project space and access them there.

Using the Cluster - Overview

We recommend that you read the Slurm Quick Start User Guide from SchedMD for information on submitting and managing jobs.

Coordinating with Other Users

The scheduler will assign jobs to resources using the information it has available. Note that we do not assign priorities to users or groups. As a result, there are times (especially near conference deadlines) when users may need to coordinate amongst themselves. (For example, to request that others hold off on submitting jobs for a few days.) To facilitate this communication and coordination, all CS Cluster users must subscribe to the local beowulf mailing list.

Connecting to the Cluster

To use the ionic cluster, simply 'ssh' into ionic using your CS username and password. This will put you on the head node in your home directory. From there, submit jobs to the Slurm scheduler using the sbatch command (for single process jobs), or combine that with srun for MPI jobs. Details are below.

The head node, ionic, may be used for interactive work such as file editing. The head node may not be used for computationally intensive processes (e.g., MATLAB). As such, most computational software is completely absent from the head node.

Email Notification

It is possible to get email notification when a job begins, ends, or aborts. If you submit many jobs in a short period of time, make sure your email account is prepared to receive them. That is, make sure you have sufficient space in your Inbox and that your email provider will accept messages at potentially high rates (messages/second). The CS email system will accept messages from the cluster at high rates. Because external sites frequently blacklist or throttle our mail servers when inundated with messages at too high a rate, you must not send cluster email notifications off-campus. If you are forwarding your CS or Princeton email account to an external provider, such as Gmail, please set your filters in such a way as to prevent the forwarding of messages from the cluster. These messages can be identified by the sender address of slurm [at] cs.princeton.edu.

High Output Volume Jobs

Those jobs that send a large amount of output to STDOUT or STDERR, should redirect the output appropriately in the job submission script. Otherwise compute node spooling space may fill, causing running jobs to fail and preventing new jobs from starting.

Cluster Maintenance, Node Failure, Checkpointing

The cluster may be brought down by hardware failure or scheduled/unscheduled maintenance. Long running jobs should be designed so that they regularly checkpoint their state so that they can be resumed in the event of an interruption without having to start the entire job again.

Using the Cluster - Details

Submitting Cluster Jobs

There are a number of different job types that can be run on a cluster, including MPI jobs, single processor jobs, and interactive jobs. These type are discussed in more detail below. MPI jobs run a single application on multiple CPUs using MPI library calls to coordinate the processing among the different CPUs. Single processor jobs, as the name implies, run on just a single CPU. Both jobs are typically initiated by creating a job file shell script. Interactive jobs allow a user to run computationally intensive interactive programs on the cluster while developing the workload into batch jobs.

Jobs are submitted to the cluster using the sbatch command. The job script always runs from the user's current working directory. The submission directory is available to the running job as the SLURM_SUBMIT_DIR environment variable. Additional shell environment variables, maintained by the scheduling system, will be available to the script at runtime. For a list of these variables, see the 'man' page for the sbatch command.

Scheduler Usage Guidelines and Tips

  • All jobs should be submitted using the Slurm scheduling system. If you are unfamiliar with how to use Slurm to schedule jobs, have a look at more of this document, and man sbatchman salloc, and man srun on the head node.
  • To run interactive sessions on the cluster, run: salloc
  • Job Resource Defaults:The following defaults are in effect for every job submitted on the ionic cluster. If your job needs more memory or time than the default, you will need to specify resources in your job submission file (see the examples below). Note that the memory available to the scheduler is the amount listed on the Ganglia status page.  Four jobs each requiring 2GB of memory will not fit on a node with 8GB of physical memory as some is held back by the operating system.
    • mem = 1024mb
    • ncpus = 1
    • nodes = 1
    • walltime = 24:00:00
  • One process per CPU:Your jobs should run only as many processes as the number of CPUs that you have requested. If you use the system default of a single CPU, your job should not invoke multiple programs in parallel. If you must run multiple, concurrent programs, be sure to specify as many CPUs as you have simultaneous running processes in your job. Starting more work processes than the number of CPUs you have requested will result in your job taking longer to run, as it will be competing with itself for CPU time.
  • The system uses the wallclock time to calculate the scheduling. The default time is 24 hours.
  • The maximum wall clock time that can be specified is 7 days. Jobs requesting more than this will be rejected.
  • The scheduler rewards users for their accuracy in specifying how long their jobs will run. It will punish users who are grossly inaccurate at specifying walltimes for their jobs. It does this in an attempt to provide more certainty for its own scheduling algorithms. Wildly inaccurate walltime parameters will result in a lower priority for jobs scheduled in the future.
  • For a view of available resources on the cluster, use the sinfo command. See man sinfo for more details.
  • To get a quick look at the job queue, run squeue. See man squeue for more details.

Will the Scheduler Kill my Job for Exceeding the walltime Parameter?

Yes; if your job exceeds the requested walltime, it will be killed by the scheduler. If you don't request a walltime, the default is 24 hours. The maximum walltime you can request for a given job is 168 hours (one week).

For this reason, it is a good idea to ensure your long-running jobs have some checkpointing capability so that if your job is killed (or fails for some other reason), you don't lose all of the work already performed.

Using GPUs

If you wish to use the GPU boards, you must specify that you need GPU resources using the --gres option to specify how many GPU devices your job needs. Therefore, sbatch options like this:

    #SBATCH -N 1
    #SBATCH --gres=gpu:1 --ntasks-per-node=1 -N 1

will request a GPU on a node with GPU capability.

IMPORTANT: Please ensure that your GPU code respects the $CUDA_VISIBLE_DEVICES variable when binding to a GPU, as this is the mechanism by which the GPUs are scheduled and allocated. Jobs found to be using GPUs not allocated to them may be killed.

Interactive Jobs

Interactive jobs on the ionic cluster are initiated with the salloc command on the head node.

    salloc srun --pty $SHELL -l

You may also want to specify non-default resource quantities:

    salloc --gres=gpu:1 -c 2 --mem=4G srun --pty $SHELL -l

Please use salloc to run interactive jobs rather than other mechanisms. Using other mechanisms to start an interactive job can lead to failures in job accounting, making it appear as though your jobs are running extremely inefficiently.

Interactive jobs should be used as a troubleshooting mechanism while creating batch submissions for your work. If you need help forming your work into batch submissions, please contact CS Staff and we will be happy to help.

We also ask that you avoid "squatting" on compute nodes - opening long interactive sessions without actively using them - since this ties up cluster resources that could be performing work for other users. Accordingly, please make sure to exit the session when you are finished.

Interactive jobs that appear to be idle for long periods of time may be cancelled by CS Staff when there are jobs in the queue waiting to use the resources.

MPI Jobs

MPI jobs on the ionic cluster should be executed using sbatch and srun commands. Here is an example MPI job submission script. This script "assumes" that the script file and the MPI program binary are in the same directory. It specifies 5 hours of "wall clock" time and all 4 processors on each of 3 nodes, for a total of 12 CPUs. Be sure to replace nobody@cs.princeton.edu with your actual email address.

#!/bin/bash
#
#***
#*** "#SBATCH" lines must come before any non-blank, non-comment lines ***
#***
#
# 3 nodes, 4 CPUs per node (total 12 CPUs), wall clock time of 5 hours
#
#SBATCH -N 3                  ## Node count
#SBATCH --ntasks-per-node=4   ## Processors per node
#SBATCH -t 5:00:00            ## Walltime
#
# send mail if the process fails
#SBATCH --mail-type=fail
# Remember to set your email address here instead of nobody
#SBATCH --mail-user=nobody@cs.princeton.edu
#

module load mpi

srun ./my_mpi_program

Single Processor Jobs

Single processor jobs are essentially programs that you wish to run in a batch mode. Just about anything that you would do with a non-interactive program or shell script can be done as a single processor job on the ionic cluster. Of course, the simplest form of job is one that runs a single long-running program. You will notice that the job script below is very similar to the one used for MPI jobs. Again, be sure to replace nobody@cs.princeton.edu with your actual email address.

#!/bin/bash
#
#***
#*** "#SBATCH" lines must come before any non-blank, non-comment lines ***
#***
#
# 1 node, 1 CPU per node (total 1 CPU), wall clock time of 30 hours
#
#SBATCH -N 1                  ## Node count
#SBATCH --ntasks-per-node=1   ## Processors per node
#SBATCH -t 30:00:00           ## Walltime
#
# send mail if the process fails
#SBATCH --mail-type=fail
# Remember to set your email address here instead of nobody
#SBATCH --mail-user=nobody@cs.princeton.edu
#

srun ./my_simple_data_crunching_program

In addition to walltime, nodes and tasks-per-node, you can specify the maximum amount of physical or virtual memory that any single process in your job is likely to use. Using the output in the job-completion email, you can add either the mem (per node) or mem-per-cpu options to the script.

The last line of output in the job completion email includes the amount of memory used by the job. If it were to report that your job used something just below 1 GB of physical memory, you could modify the script with a line like this:

#SBATCH --mem=1GB

For future runs of the job, this modification will give the scheduler a hint that the job is not expected to use more than 1GB of memory.

Using srun to duplicate Single Processor Jobs

If you have a single processor job that needs to be run multiple times with exactly the same command arguments and you'd like to get away with a single job script and sbatch command, you can modify the srun line to replicate your job for you. The example script below will run 6 instances of the same command on a single CPU on each of 6 different nodes. As with the 2 examples above, be sure to replace nobody@CS.Princeton.EDU with your actual email address.

#!/bin/bash
#
#***
#*** "#SBATCH" lines must come before any non-blank, non-comment lines ***
#***
#
# 6 nodes, 1 CPU per node (total 6 CPUs), wall clock time of 3 hours
#
#SBATCH -N 6                  ## Node count
#SBATCH --ntasks-per-node=1   ## Processors per node
#SBATCH -t 3:00:00            ## Walltime
#
# send mail if the process fails
#SBATCH --mail-type=fail
# Remember to set your email address here instead of nobody
#SBATCH --mail-user=nobody@cs.princeton.edu
#

srun -n 6 ./my_simple_program

The result of the above script getting submitted using sbatch is that ./my_simple_program gets run once on 6 different machines.

Diagnosing Problems

The 2 major problems you will most likely encounter are jobs that don't start, and jobs that terminate unexpectedly.

The first thing to look at when diagnosing a job problem is the STDIN and STDOUT output of the job. Unless you direct them elsewhere, STDIN and STDOUT will both be written to a file named slurm-[JOBID].out.

Another command useful for diagnosis is the sacct command. This command provides accounting information which may be helpful in diagnosing unexpected behavior. Try a command like sacct -j [JOBID] or sacct -j [JOBID] -l.