Running on a cluster

One of the reasons for this Python port is to make it easier to run MetaWards analyses at scale on a HPC cluster. MetaWards supports parallelisation using MPI (via mpi4py) or simple networking (via scoop).

MetaWards will automatically detect most of what it needs so that you don’t need to write a complicated HPC job script.

MetaWards will look for a hostfile via either the PBS environment variable of PBS_NODEFILE, or the slurm SLURM_HOSTFILE, or for a hostfile passed directly via the --hostfile command line argument.

It will then use the information combined there, together with the number of threads per model run requested by the user, and the number of cores per compute node (set in the environment variable METAWARDS_CORES_PER_NODE, or passed as the command line option --cores-per-node) to work out how many parallel scoop or MPI processes to start, and will start those in a round-robin fashion across the cluster. Distribution of work to nodes is via the scoop or mpi4py work pools.

What this means is that the job scripts you need to write are very simple.

Example PBS job script

Here is an example job script for a PBS cluster;

#!/bin/bash
#PBS -l walltime=01:00:00
#PBS -l select=4:ncpus=64:mem=64GB
# The above sets 4 nodes with 64 cores each

# source the version of metawards we want to use
# (assumes your python environments are in $HOME/envs)
source $HOME/envs/metawards-0.6.0/bin/activate

# change into the directory from which this job was submitted
cd $PBS_O_WORKDIR

# if you need to change the path to the MetaWardsData repository,
# then update the below line and uncomment
#export METAWARDSDATA="$HOME/GitHub/MetaWardsData"

metawards --additional ExtraSeedsBrighton.dat \
          --input ncovparams.csv --repeats 8 --nthreads 16

The above job script will run 8 repeats of the adjustable parameter sets in ncovparams.csv. The jobs will be run using 16 cores per model run, over 4 nodes with 64 cores per node (so 256 cores total, running 16 model runs in parallel). The runs will take only a minute or two to complete, hence why it is not worth requesting more than one hour of walltime.

The above job script can be submitted to the cluster using the PBS qsub command, e.g. if the script was called submit.sh, then you could type;

qsub submit.sh

You can see the status of your job using

qstat -n

Example slurm job script

Here is an example job script for a slurm cluster;

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=64
# The above sets 4 nodes with 64 cores each

# source the version of metawards we want to use
# (assumes your python environments are in $HOME/envs)
source $HOME/envs/metawards-0.6.0/bin/activate

# if you need to change the path to the MetaWardsData repository,
# then update the below line and uncomment
#export METAWARDSDATA="$HOME/GitHub/MetaWardsData"

metawards --additional ExtraSeedsBrighton.dat \
          --input ncovparams.csv --repeats 8 --nthreads 16

This script does the same job as the PBS job script above. Assuming you name this script submit.slm you can submit this job using

sbatch submit.slm

You can check the status of your job using

squeue -u USER_NAME

where USER_NAME is your cluster username.