Dask & sklearn

Dask

This presumes you have already done the basic dask guide.

Dask is an alternative scheduling and job distribution system that can be flexibly deployed across a wide range of HPC environments. Writing in dask can allow your script to quickly translate between different HPC environments, at the cost of some increased overhead in some scenarios.

One advantage of Dask is that it has inbuilt capabilities that allow it to play very nicely with sklearn. This tutorial shows one such example, using a random search across a set of parameters using nearly entirely inbuilt tools.

The Job Script

The job script only launches a single node - a master node. This will then spawn other processes. You do not specify the total resources you want here - i.e., you only want this to ever have one node.

#!/bin/tcsh
#PBS -N demojob
#PBS -l nodes=1:vortex:ppn=12
#PBS -l walltime=00:30:00
#PBS -j oe

source "/usr/local/anaconda3-2021.05/etc/profile.d/conda.csh"
module load anaconda3/2021.05
module load python/usermodules

unsetenv PYTHONPATH

conda activate aml35

cd /sciclone/home20/dsmillerrunfol/myPythonFileDirectory
python dask_example.py >& out_dask.out

The Python Script

Much of our logic for submitting jobs now lives in our python script. Below, you will see a heavily commented version of a python script which searches across a wide range of hyperparameters to identify the best combination using a dask backend.

Last updated