CS Cluster Computing

Overview

The department provides a Beowulf cluster, known as ionic, for users who need a high performance computing (HPC) cluster environment in order to perform their work.

The primary way of using the cluster is to submit batch jobs and a description of the resource requirements (CPU, RAM, run time) to the scheduler.  When resources are available, the job will run.

If you have questions or concerns, please read this page AND the Cluster FAQ. If you can't find an answer in either place, reach out to CS Staff.

Access

Access to the ionic cluster is restricted to current members of the CS department through their CS account (limited guest accounts are ineligible). If you do not yet have a CS Account, please first see the Getting a CS Account page to acquire an account. Once you have a CS Account, but before attempting to use the cluster, it is important that you subscribe to the beowulf mailing list at:

    https://lists.cs.princeton.edu/mailman/listinfo/beowulf

This low-traffic mailing list is used to announce outages or updates to the cluster.  Crucially, joining this list gets you added to the access list necessary to schedule jobs through the Slurm scheduler. It is also the place for users to coordinate usage during critical times such as conference deadlines.  (See the Coordinating with Other Users section below for more details.)

NOTE: The cluster is only accessible to users in the CS department.  If you are interested in using a cluster that can be used collaboratively with users outside the CS department, OIT has several clusters available at the TIGRESS High Performance Computing Center.

Hardware

The heterogeneous ionic cluster is comprised of server hardware from a variety of sources.  It currently occupies thirteen racks in the CS section of the University Data Center.  Each server (also called, nodes) runs Linux and one of the servers is designated the head node.  All other servers are compute nodes.

Network

Each group of racks in the cluster contains an Ethernet switch.  As these are at the top position in the rack they are called, unimaginatively, top-of rack (TOR) switches.  Each server has a 10Gbps Ethernet uplink to its TOR switch.  These TOR switches uplink at 120Gbps each to a spine switch connected to the department's central storage cluster at 200Gbps to access home directories and project space.

Node Specifications

As mentioned, the ionic cluster is heterogeneous.  Typically, a few servers are added each year while a few older servers are removed (either due to failure or extreme old age).  As a result, the cluster contains servers of varying specifications. If you need specific details about the physical characteristics of a server, contact CS Staff.

As of March 2024, the public portion of the ionic cluster consists of:

  • 6 x Fujitsu Primergy CX2570 systems
  • 3 x Fujitsu Primergy RX2530 systems
  • 14 x Dell PowerEdge R730 systems
  • 3 x Tyan Thunder servers with 10 x NVidia RTX A5000 GPUs
  • 3 x Tyan Thunder servers with 10 x NVidia RTX A6000 GPUs

The cluster also contains hardware owned by individual research groups within the CS Department, many of which are shared, on a limited basis, with other department members. As of March 2024, these include:

  • 5 x Tyan Thunder servers with 10 x NVidia GeForce GTX 1080 Ti GPUs each
  • 22 x Gigabyte Cyclops 2U servers with 8 x NVidia GeForce RTX 2080 Ti GPUs each
  • 11 x Tyan Thunder servers with 10 x NVidia GeForce RTX 2080 Ti GPUs each
  • 15 x Gigabyte Cyclops 4U servers with 10 x NVidia GeForce RTX 3090 Turbo GPUs each
  • 1 x Tyan Thunder servers with 10 x NVidia Tesla A40 GPUs
  • 1 x Tyan Thunder server with 8x NVidia Tesla A100 80GB GPUs
  • 1 x Tyan Thunder servers with 10 x NVidia Quadro RTX 6000 GPUs
  • 4 x Tyan Thunder servers with 10 x NVidia RTX A5000 GPUs
  • 10 x Tyan Thunder servers with 10 x NVidia RTX A6000 GPUs each
  • 2 x Lenovo SR670 V2 servers with 8 x NVidia L40 GPUs each

Storage

 

Compute nodes in the cluster have access to three types of disk storage:

  • local: local disk space that is not used by the operating system is consolidated into a partition at /scratch.  Accessing this storage will have the least network contention.
  • cluster: the /n/fs/scratch partition on login nodes (cycles) is available to the compute nodes via NFS.  
  • remote: Each compute node mount home directories and project space via NFS from the department's central file system.

The local storage is intended to be used to stage input files (that are read multiple times) as well as intermediate files. By convention, if you use /scratch or /scratch/network, create and use a subdirectory with your username:

    mkdir -p /scratch/$USER ; cd /scratch/$USER

The local and cluster storage is considered temporary (i.e., "scratch") and users are expected to clean up their files at the end of each job.  Nodes that fail for one reason or another are often re-imaged without warning resulting in a loss of any data stored in the /scratch partition.

Given the nature of the cluster and its network topology, it is possible for poorly designed jobs to overwhelm the network connection back to the remote file system.  If your job will need to access the same remote file multiple times and the data will fit on local disk, we encourage you to structure your code to first copy the data to the node and then access it locally.  This should result in increased performance of not just your job but the other jobs running on the cluster as well.

If you are timing your code, be aware that accessing NFS-mounted file systems introduces both latency and uncertainty. Be sure to time only the computation, otherwise what you will be measuring may be dominated by contention on: the Ethernet switches from the node to the NFS server, the NFS server itself, the SAN switch to the disk arrays, and the disk arrays themselves.

Software

Cluster Software Configuration

The cluster is built using the Springdale (formerly, PUIAS) Linux distribution which is a clone recompile of RedHat Enterprise Linux, with rebranding. Scheduling is done with the Slurm workload manager. MPICH, LAM and other library collections of use in cluster environments are also installed on the cluster. To get started with Slurm, you are encouraged to review the documentation and tutorials available from SchedMD, particularly the Quick Start User Guide. For users familiar with Torque/Maui, switching to Slurm should be a fairly simple process of converting your submission scripts. This page also provides a very quick intro with examples that will help most folks get started with Slurm. MATLAB version R2014a with the parallel toolkit is also available. It is worth noting that our license is 32-seat for the cluster toolkit; you will not be able to use this toolkit and achieve higher parallelism.

Software Packages and Libraries

To find out about available software, please check our software page.

If you need a software package or library that is not currently installed on the compute nodes, please contact CS Staff.  If it is already available in a repository or as an RPM for our distribution, we should be able to install it within one business day.  Packages not available as an RPM take longer to prepare for our automated configuration management system.

Please note that we might not globally install highly specialized packages (that are unlikely to be used by others in the department) or packages that conflict with already installed packages.  In these cases, you may need to install them locally in your home directory or project space and access them there.

Jupyter Notebook on CS Ionic Cluster via ssh tunnel

With some additional setup, it is possible to start the Jupyter server on the ionic cluster, forward the connection to a local system, and connect using a local browser.  

By port forwarding from the cluster to your local system (port 10001 in this example), you can open any browser on your local computer to "http://localhost:10001".

example:  ssh -L 10001:node907.ionic.cs.princeton.edu:10001 cycles.cs.princeton.edu

Please make sure any listening ports you establish are in the range between 10000 and 60000.  Ports below 10000 or above 60000 will not work.

Using the Cluster - Overview

We recommend that you read the Slurm Quick Start User Guide from SchedMD for information on submitting and managing jobs.

Coordinating with Other Users

The scheduler will assign jobs to resources using the information it has available. Note that, in general, we do not assign priorities to users or groups. As a result, there are times (especially near conference deadlines) when users may need to coordinate amongst themselves. (For example, to request that others hold off on submitting jobs for a few days.) To facilitate this communication and coordination, all CS Cluster users must subscribe to the local beowulf mailing list.

Connecting to the Cluster

To use the ionic cluster, simply 'ssh' into one of CS Department login nodes - cycles using your CS username and password.  From there, submit jobs to the Slurm scheduler using the sbatch command (for single process jobs), or combine that with srun for MPI jobs. Details are below.

Email Notification

It is possible to get email notification when a job begins, ends, or aborts. If you submit many jobs in a short period of time, make sure your email account is prepared to receive them. That is, make sure you have sufficient space in your Inbox and that your email provider will accept messages at potentially high rates (messages/second). The CS email system will accept messages from the cluster at high rates. Because external sites frequently block or throttle our mail servers when inundated with messages at too high a rate, you must not send cluster email notifications off-campus. If you are forwarding your CS or Princeton email account to an external provider, such as Gmail, please set your filters in such a way as to prevent the forwarding of messages from the cluster. These messages can be identified by the sender address of slurm [at] cs.princeton.edu.

High Output Volume Jobs

Those jobs that send a large amount of output to STDOUT or STDERR, should redirect the output appropriately in the job submission script. Otherwise compute node spooling space may fill, causing running jobs to fail and preventing new jobs from starting.

Cluster Maintenance, Node Failure, Checkpointing

The cluster may be brought down by hardware failure or scheduled/unscheduled maintenance. Long running jobs should be designed so that they regularly checkpoint their state so that they can be resumed in the event of an interruption without having to start the entire job again.

Using the Cluster - Details

Submitting Cluster Jobs

There are a number of different job types that can be run on a cluster, including MPI jobs, single processor jobs, and interactive jobs. These type are discussed in more detail below. MPI jobs run a single application on multiple CPUs using MPI library calls to coordinate the processing among the different CPUs. Single processor jobs, as the name implies, run on just a single CPU. Both jobs are typically initiated by creating a job file shell script. Interactive jobs allow a user to run computationally intensive interactive programs on the cluster while developing the workload into batch jobs.

Jobs are submitted to the cluster using the sbatch command. The job script always runs from the user's current working directory. The submission directory is available to the running job as the SLURM_SUBMIT_DIR environment variable. Additional shell environment variables, maintained by the scheduling system, will be available to the script at runtime. For a list of these variables, see the 'man' page for the sbatch command.

Scheduler Usage Guidelines and Tips

  • All jobs should be submitted using the Slurm scheduling system. If you are unfamiliar with how to use Slurm to schedule jobs, have a look at more of this document, and man sbatchman salloc, and man srun.
  • To run interactive sessions on the cluster, run: salloc
  • Job Resource Defaults:The following defaults are in effect for every job submitted on the ionic cluster. If your job needs more memory or time than the default, you will need to specify resources in your job submission file (see the examples below). Note that the memory available to the scheduler is the amount listed on the Ganglia status page.  Four jobs each requiring 2GB of memory will not fit on a node with 8GB of physical memory as some is held back by the operating system.
    • mem = 1024mb
    • ncpus = 1
    • nodes = 1
    • walltime = 24:00:00
  • One process per CPU:Your jobs should run only as many processes as the number of CPUs that you have requested. If you use the system default of a single CPU, your job should not invoke multiple programs in parallel. If you must run multiple, concurrent programs, be sure to specify as many CPUs as you have simultaneous running processes in your job. Starting more work processes than the number of CPUs you have requested will result in your job taking longer to run, as it will be competing with itself for CPU time.
  • The system uses the wallclock time to calculate the scheduling. The default time is 24 hours.
  • The maximum wall clock time that can be specified is 7 days. Jobs requesting more than this will be rejected.
  • The scheduler works best when wall times are accurately estimated. If jobs complete much earlier than their allocated time, the Backfill scheduler becomes less able to fill unused time.
  • To run a job in a private partition, you must be a member of the appropriate Account, and specify that account either at the command line or in your batch submission using the -A argument.
  • For a view of available resources on the cluster, use the sinfo command. See man sinfo for more details.
  • To get a quick look at the job queue, run squeue. See man squeue for more details.

Will the Scheduler Kill my Job for Exceeding the walltime Parameter?

Yes; if your job exceeds the requested walltime, it will be killed by the scheduler. If you don't request a walltime, the default is 24 hours. The maximum walltime you can request for a given job is 168 hours (one week).

For this reason, it is a good idea to ensure your long-running jobs have some checkpointing capability so that if your job is killed (or fails for some other reason), you don't lose all of the work already performed.

Using GPUs

If you wish to use the GPU boards, you must specify that you need GPU resources using the --gres option to specify how many GPU devices your job needs. Therefore, sbatch options like this:

    #SBATCH -N 1
    #SBATCH --gres=gpu:1 --ntasks-per-node=1 -N 1

will request a GPU on a node with GPU capability.

IMPORTANT: Please ensure that your GPU code respects the $CUDA_VISIBLE_DEVICES variable when binding to a GPU, as this is the mechanism by which the GPUs are scheduled and allocated. Jobs found to be using GPUs not allocated to them may be killed.

Interactive Jobs

Interactive jobs on the ionic cluster are initiated with the salloc command from the login node.

    salloc srun --pty $SHELL -l

You may also want to specify non-default resource quantities:

    salloc --gres=gpu:1 -c 2 --mem=4G srun --pty $SHELL -l

If you are working with graphics and need X11 enabled, add --x11.

Please use salloc to run interactive jobs rather than other mechanisms. Using other mechanisms to start an interactive job can lead to failures in job accounting, making it appear as though your jobs are running extremely inefficiently.

Interactive jobs should be used as a troubleshooting mechanism while creating batch submissions for your work. If you need help forming your work into batch submissions, please contact CS Staff and we will be happy to help.

We also ask that you avoid "squatting" on compute nodes - opening long interactive sessions without actively using them - since this ties up cluster resources that could be performing work for other users. Accordingly, please make sure to exit the session when you are finished.

Interactive jobs that appear to be idle for long periods of time may be cancelled by CS Staff when there are jobs in the queue waiting to use the resources.

MPI Jobs

MPI jobs on the ionic cluster should be executed using sbatch and srun commands. Here is an example MPI job submission script. This script "assumes" that the script file and the MPI program binary are in the same directory. It specifies 5 hours of "wall clock" time and all 4 processors on each of 3 nodes, for a total of 12 CPUs. Be sure to replace nobody@cs.princeton.edu with your actual email address.

#!/bin/bash
#
#***
#*** "#SBATCH" lines must come before any non-blank, non-comment lines ***
#***
#
# 3 nodes, 4 CPUs per node (total 12 CPUs), wall clock time of 5 hours
#
#SBATCH -N 3                  ## Node count
#SBATCH --ntasks-per-node=4   ## Processors per node
#SBATCH -t 5:00:00            ## Walltime
#
# send mail if the process fails
#SBATCH --mail-type=fail
# Remember to set your email address here instead of nobody
#SBATCH --mail-user=nobody@cs.princeton.edu
#

module load mpi

srun ./my_mpi_program

Single Processor Jobs

Single processor jobs are essentially programs that you wish to run in a batch mode. Just about anything that you would do with a non-interactive program or shell script can be done as a single processor job on the ionic cluster. Of course, the simplest form of job is one that runs a single long-running program. You will notice that the job script below is very similar to the one used for MPI jobs. Again, be sure to replace nobody@cs.princeton.edu with your actual email address.

#!/bin/bash
#
#***
#*** "#SBATCH" lines must come before any non-blank, non-comment lines ***
#***
#
# 1 node, 1 CPU per node (total 1 CPU), wall clock time of 30 hours
#
#SBATCH -N 1                  ## Node count
#SBATCH --ntasks-per-node=1   ## Processors per node
#SBATCH -t 30:00:00           ## Walltime
#
# send mail if the process fails
#SBATCH --mail-type=fail
# Remember to set your email address here instead of nobody
#SBATCH --mail-user=nobody@cs.princeton.edu
#

srun ./my_simple_data_crunching_program

In addition to walltime, nodes and tasks-per-node, you can specify the maximum amount of physical or virtual memory that any single process in your job is likely to use. Using the output in the job-completion email, you can add either the mem (per node) or mem-per-cpu options to the script.

The last line of output in the job completion email includes the amount of memory used by the job. If it were to report that your job used something just below 1 GB of physical memory, you could modify the script with a line like this:

#SBATCH --mem=1GB

For future runs of the job, this modification will give the scheduler a hint that the job is not expected to use more than 1GB of memory.

Using srun to duplicate Single Processor Jobs

If you have a single processor job that needs to be run multiple times with exactly the same command arguments and you'd like to get away with a single job script and sbatch command, you can modify the srun line to replicate your job for you. The example script below will run 6 instances of the same command on a single CPU on each of 6 different nodes. As with the 2 examples above, be sure to replace nobody@CS.Princeton.EDU with your actual email address.

#!/bin/bash
#
#***
#*** "#SBATCH" lines must come before any non-blank, non-comment lines ***
#***
#
# 6 nodes, 1 CPU per node (total 6 CPUs), wall clock time of 3 hours
#
#SBATCH -N 6                  ## Node count
#SBATCH --ntasks-per-node=1   ## Processors per node
#SBATCH -t 3:00:00            ## Walltime
#
# send mail if the process fails
#SBATCH --mail-type=fail
# Remember to set your email address here instead of nobody
#SBATCH --mail-user=nobody@cs.princeton.edu
#

srun -n 6 ./my_simple_program

The result of the above script getting submitted using sbatch is that ./my_simple_program gets run once on 6 different machines.

Diagnosing Problems

The 2 major problems you will most likely encounter are jobs that don't start, and jobs that terminate unexpectedly.

The first thing to look at when diagnosing a job problem is the STDIN and STDOUT output of the job. Unless you direct them elsewhere, STDIN and STDOUT will both be written to a file named slurm-[JOBID].out.

Another command useful for diagnosis is the sacct command. This command provides accounting information which may be helpful in diagnosing unexpected behavior. Try a command like sacct -j [JOBID] or sacct -j [JOBID] -l.