Novus HPC Cluster

Novus is ARCS's new HPC cluster. Be one of the first to explore the new potential of Novus: early adopters will enjoy a unique advantage — virtually full access to the entire HPC cluster! Say goodbye to resource contention and job wait times, as you will have Novus all to yourself, ensuring unparalleled speed and responsiveness.

The new cluster has 1,712 total cores, 9.3TB of total memory with ample scratch space at 80TB.

In the new Novus HPC cluster SLURM will be the new scheduler used for submitting, monitoring, and controlling jobs on the cluster. Although existing users are very familiar with Sun Grid Engine (SGE), we are switching to SLURM as it offers several advantages over SGE. SLURM is also used at many of the large national HPC centers. In fact, SLURM is used in about 60% of the top 500 supercomputer centers around the world!

Trivia: SLURM is the acronym for "Simple Linux Utility for Resource Management."

The good news is that switching from SGE to SLURM is easy. See FAQs below for common SGE commands mapped over to the equivalent SLURM commands. We have had success with AI tools like Microsoft CoPilot and ChatGPT to convert SGE submit scripts to Slurm scripts for you. Simply input your SGE script and request it converted to Slurm format. Some minor edits will be necessary, such as updating the job queue name to the partition name matched on Novus.

Request Access

FAQ

There are a couple of options on how to access the HPC cluster. The first method, is to SSH into novus.dri.oregonstate.edu using your ONID username and password.

The second option is to access the HPC cluster using the web interface.

If connecting from off campus, please refer to our knowledgebase article.

Command line conversions

COMMAND	SGE	SLURM
Interactive login	qlogin or qrsh	srun <args> --pty bash
Cluster status	qhost	sinfo
Job submission	qsub <job_file>	sbatch <job_file>
Job deletion	qdel <job_id>	scancel <job_id>
Job status	qstat or show	squeue
Job status by ID	qstat -j <job_id>	squeue -j <job_id>
Job status by user	qstat -u <username>	squeue -u <username>
Job hold	qhold <job_id>	scontrol hold <job_id>
Job release	qrls <job_id>	scontrol release <job_id>
Queue list	qconf -sql	scontrol show partition
Queue details	qconf sq <queue>	scontrol show partition <queue>
Node list	qhost	scontrol show nodes
Node details	qhost -F <node>	scontrol show node <node>
Monitor all jobs	qacct	sacct
Monitor a job	qacct -j <job_id>	sacct -j <job_id>
Cluster status	qhost -q	sinfo
Job status detailed	qstat -j <job_id>	scontrol show job <job_id>
Show expected start time	qstat -j <job_id>	squeue -j <job_id> --start
X forwarding	qsh <args>	salloc <args> or srun <args> --pty
Review job resource usage	qacct -j <job_id>	sacct -j <job_id>

Job name	$JOBNAME	$SLURM_JOB_NAME
Job ID	$JOBID	$SLURM_JOB_ID
Submission directory	$SGE_O_WORKDIR	$SLURM_SUBMIT_DIR
Submission host	$SGE_O_HOST	$SLURM_JOB_NOELIST
Node list	$PE_HOSTFILE	$SLURM_JOB_NODELIST
Job user	$USER	$SLURM_JOB_USER
Job array index	$SGE_TASK_ID	$SLURM_ARRAY_TASK_ID
First array task	$SGE_TASK_FIRST	$SLURM_ARRAY_TASK_MIN
Last array task	$SGE_TASK_LAST	$SLURM_ARRAY_TASK_MAX
Queue name	$QUEUE	$SLURM_JOB_PARTITION
Number of allocated procs	$NSLOTS	$SLURM_NTASKS
Number of allocated nodes	$NHOSTS	$SLURM_JOB_NUM_NODES
Hostname	$HOSTNAME	$SLURM_SUBMIT_HOST

Description	SGE	SLURM	Comments
Script directive	#$	#SBATCH
Assign environment settings	-V <variable=value,…>	Default in SLURM, no need to specify
Set working dir	-wd <dir_path>	-chdir=<dir_path> or -D <dir_path>
Assign job name	-N <job_name>	--job-name=<job_name> or -J <job_name>
Output file	-o <output_file>	-o <output_file> or --output=<output_file>
Error file	-e <error_file>	-e <error_file> or --error=<error_file>
Queue	-q <queue_name>	--partition=<queue> or -p <queue>
Job/Task allocation	-pe <parallel env>	--nodes=<nodes> or -N <nodes> --ntasks=<cores> or -n <cores> --ntask-per-core=<tasks_per_core> --ntask_per-node=<task_per_node> --ntask-per-socket=<tasks_per_socket> --cpus-per-task=<cpu_per_task>
Memory allocation	-l h_vmem=<float>G	--mem=<float>G --mem-per-cpu=<float>G	1) -l h_vmem assigns memory per core 2) --mem assigns memory per node 3) –mem-per-cpu assigns memory per core
Time limit	-l h_rt=<HH:MM:SS>	--time=<HH:MM:SS> or -t <HH:MM:SS>
Project name	-P <project_name>	--account=<project_name> or -A <project_name>
Array range and increment	-t <start_num>- <end_num>:<incremen t>	--array=<start_num>- <end_num>:<increment> or -a <start_num>- <end_num>:<increment>
Notification emails	-e <email_address>	--mail-user=<email_address> --mail-type=<BEGIN \| END \| FAIL \| REQUE \| ALL>
Job dependency	-hold_jid <job_ID \| job_name>	--dependency=after:job_id[:job_id...] --dependency=afterany:job_id[:job_id. ..]-- dependency=afternotok:job_id[:job_i d...] --dependency=afterok:job_id[:job_id... ] --dependency=aftercorr:job_id --dependency=singleton
Begin Time	-a <YYMMDDhhmm>	--begin=<YYYY-MMDD[THH:MM:[SS]]>
Generic Resources Scheduling		--gpus=<count> --gpus-per-node=<count>

SGE Script	Slurm Script
#!/bin/bash # # #$ -N my_job #$ -j y #$ -o my_job.output # Current working directory #$ -cwd #$ -M [email protected] #$ -m bea # Request for 4 hours run time #$ -l h_rt=4:0:0 # Specify the project for job #$ -P project_name_here # Set Memory for job #$ -l mem=2G # Change version of R module load R/4.2.1 echo "start R job" Rscript input.r echo "R Finished"	#!/bin/bash -l # NOTE the -l flag, used to create a login shell # #SBATCH -J my_job #SBATCH -o my_job.output #SBATCH -e error.output # Default in slurm #SBATCH -D ./ #SBATCH --mail-user [email protected] #SBATCH --mail-type=ALL # Request 4 hours run time #SBATCH -t 4:0:0 # Specify the project for job #SBATCH -A project_name_here # Set Memory for job #SBATCH --mem=2000 # Change version of R module load $/4.2.1 echo "start R job" Rscript input.r echo “R Finished"

SGE Script

Slurm Script

#!/bin/bash

#$ -N my_job

#$ -j y

#$ -o my_job.output

# Current working directory

#$ -cwd

#$ -M [email protected]

#$ -m bea

# Request for 4 hours run time

#$ -l h_rt=4:0:0

# Specify the project for job

#$ -P project_name_here

# Set Memory for job

#$ -l mem=2G

# Change version of R

module load R/4.2.1

echo "start R job"

Rscript input.r

echo "R Finished"

#!/bin/bash -l

# NOTE the -l flag, used to create a login shell

#SBATCH -J my_job

#SBATCH -o my_job.output

#SBATCH -e error.output

# Default in slurm

#SBATCH -D ./

#SBATCH --mail-user [email protected]

#SBATCH --mail-type=ALL

# Request 4 hours run time

#SBATCH -t 4:0:0

# Specify the project for job

#SBATCH -A project_name_here

# Set Memory for job

#SBATCH --mem=2000

# Change version of R

module load $/4.2.1

echo "start R job"

Rscript input.r

echo “R Finished"

Option 1: Using rsync. The recommended way to copy your data over to the Novus cluster. Login to the DRI HPC via ssh or through the OnDemand web gui. Open up a terminal tab session (if using OnDemand) and then run the following command to copy a folder to your home folder on Novus.

rsync -avz ./{folder_name}/ [email protected]:/home/ONID/{folder_name}/

How storage works on Novus cluster

Novus cluster has access to two primary storage pools. First is the basic bulk storage where your home folder resides. Each user gets 100GB of storage quota for their home folder. The second storage pool is the scratch space. The scratch space is a high-speed all-flash storage array directly connected to the Novus cluster.

Slurm, the job scheduler, has been configured only to allow jobs to run in the scratch space. We do not want jobs running on the slower bulk storage space where your home folder resides.

When you log in to the Novus cluster, in your home folder you will see three folders:

globus – Place files here for the Globus service to access. Globus is a service that lets you share data with collaborators around the world. You specify what data, and which colleagues can have access.
novus – This is the primary scratch space on the cluster. This is the folder where you want your jobs to run. Slurm will only submit jobs so long as the working directory is under this folder.
ondemand – This folder is in the same scratch space on the cluster but in a separate folder where all OnDemand actions will take place. OnDemand is the web interface to the Novus cluster.

Note: Everything in the scratch space is meant for temporary storage only. Once your job(s) finish, all data should then be moved back to your home folder or copied to some other storage location. There will be a process to automatically delete any data that has not been accessed in 90 days from the scratch storage space.

Globus lets you share data on your storage systems with collaborators at other institutions. You specify what data. You specify which colleagues. More information about Globus is available on the Globus website.
Globus has been integrated into the OnDemand Web interface to Novus. To log into the OnDemand web interface, visit the Novus OnDemand portal in your web browser.

In Slurm, job queues are called partitions. You can get the list of partitions by running the command "sinfo". The general partitions most users will use are:

dri.q - General shared compute nodes.
- Total CPU cores: 1,296
- Memory: 7,887GB
- Max CPU cores on a node: 48
- Max Memory on a node: 382GB.
preempt.q - This includes all compute nodes in dri.q plus PI-owned compute nodes if they are free. As the name implies, if your job lands on a PI-owned compute node your job could be terminated if anyone in the PI group starts a job.
- Total CPU cores: 1,664
- Total Memory: 8,730GB
- Max CPU cores on a node: 256
- Max Memory on a node: 512GB
gpu.q - Shared GPU node.
preempt-gpu.q - Same as preempt.q but with GPU nodes.
test.q - Small partition for testing small jobs.
- Total CPU cores: 24
- Total Memory: 2GB

For example to specify a partition for your job to run on be sure to include the following line in your submit script:

#SBATCH --partition=dri.q

Multiple ways exist to copy/move data off the Novus cluster. The most basic way is to download data to your local system using the Open OnDemand web interface to the Novus cluster. From the file explorer pane you can select the file and then click the download button on the top menu bar. If you need to copy a folder and all sub-folders and content we recommend you first compress and package the folder into a single zip file using the zip command:

zip -r file.zip {folder-name}

This will create a single compress file with all the contents in the specified folder name. Then you can download the zip file using your OnDemand web interface.

Copy data to another storage server or cloud storage service

In this instance, we recommend you use the rclone tool. rclone allows you to copy data to many well-known cloud storage providers like Box or Microsoft One Drive. rclone can also copy data to local OSU Windows-based/Linux Samba files servers. Instructions on how to set these up can be found in the official rclone documentation.

Using sshfs

The sshfs command is a client tool for using SSHFS to mount a remote file system from another server locally to a folder in your home folder on Novus. This works only with Unix based file servers that allow ssh connections. To use sshfs you will need to create an empty folder to use as the mount point. See example below:

cd ~ (Make sure you are in your home folder)
mkdir mnt (This will be the folder to use as the mount point)
sshfs [ONID@]host:[dir-path] ./mnt (Replace host with the server name, replace dir-path with the directory path on the remote server you want to connect to)

After this, you will be able to list and work in the ./mnt folder as if it were a local file system on Novus.

NOTE: Be sure to umount the ./mnt folder when you are finished with the following command:

fusermount -u ~/mnt

To mount a on-prem Windows/Linux Samba share to a folder in your home folder to transfer data. On Novus command line do the following commands:

mkdir mnt (Will use this folder as the mount point for the rclone command)
rclone config (To configure rclone)
- Select n for the new remote
- Give the remote a name: ex. remote
- Select XX for "SMB/CIFS" option, at the time of this writing it's option 46
- For hostname type in the server name Windows/Linux Samba server
- For user name type in your ONID account name
- SMB port leave the default
- SMB password, choose y to type in your ONID password. Rclone will store your password in encrypted form.
- For domain type in ONID
- For spn, leave blank.
- For edit advanced config answer No
- Then type y to keep this remote settings.
- Then q to quit.

To mount your share on the Window/Linux Samba server do the following:

Run rclone to mount on the newly created folder (mnt) and place it in the background with the "&".
- rclone mount remote:share-name/folderpath ./mnt --vfs-cache-mode writes &

After this, you should be able to list the contents of the mounted share under the mnt folder. You can now copy files to and from this mount point. When you are finished be sure to umount the folder by typing this command:

kill %1 (This will kill the rclone process you placed running in the background above)

The rclone utility is a very powerful tool and can allow you to copy data to other cloud services like Microsoft OneDrive and Box.

Each Conda environment is isolated, meaning you can install different versions of Python and packages without conflicts. This is especially useful when working on multiple projects that require different dependencies.

To set up your own custom Python environment follow the steps below:

log in to the Novus head node either by SSH or through the OnDemand web interface and start a terminal window tab. If first time create a Python environment, you will need to run: conda init followed by source .bashrc
To create your Python environment: conda create --name myenv
- If you need a specific version of Python add to the above command: python={version}
Activate your new Python environment: conda activate myenv
Now you can install Python packages, for example: conda install -c conda-forge tensorflow
To access your new Python environment with the interactive Jupyter notebook on OnDemand, you will need to install the ipykernel: conda install -c anaconda ipykernel
Next you need to add your new kernel with your custom Python environment: python -m ipykernel install --user --name=myenv
You can now deactivate your Python environment: conda deactivate
Start your interactive Jupyter notebook in OnDemand and select your new Python kernel.

You can easily restore deleted or modified files from your home folder or the shared research folder under /shared on the Novus cluster. Note this doesn't include the /scratch area on the cluster.

Snapshots are run hourly on the Novus cluster and retain versions of files up to 30 days. To restore a file, you need to "cd .snapshot" from the command line. In the snapshot folder, you will see a list of time-stamped folders. Go to the desired date and time, and then copy the files back to your home or shared research folder.

Novus HPC Cluster

FAQ

How do I connect to the Novus cluster?

Command Line Conversions

Command line conversions

Environment Variables

Job Submission Script Changes

Sample Job Script

How to copy data to Novus Cluster

How storage works on Novus

What is Globus?

What are the Job Queues/Partitions on Novus?

How to move data off of Novus?

Copy data to another storage server or cloud storage service

Using sshfs

rclone example to mount a Windows/LInux Samba share

How to create and access a custom Python environment with interactive Jupyter notebook in OnDemand

How to restore delete/modified file/folder