Random Forest
This presumes that you have already completed the examples in the Random Search section of the guide.
Introduction & Data
A Random Forest Classifier is made up of many independent classification (or regression) trees, making it a perfect candidate for parallelization across multiple processes. Following that, scikit-learn has inbuilt options that allow you to fit these trees across multiple processors, if they are available.
We will write a script which does this twice, first running on a single core, and then running across 12 cores. We will time each one, and return to the user the total runtimes.
To facilitate this comparison, we have created a bigger version of the student alcohol dataset, copying the rows to generate a larger number of observations. This is helpful, as it slows the algorithms down and allows you to more easily see the difference between single and multi-core implementations.
Job File
First, we'll need to setup our job file. Most things here are similar to what you saw in the Random Search example, but with some key differences.
In this job file, you'll want to note two things. First, we're only requesting one node. This is because the inbuilt capability of sklearn to distribute only extends to a single node - i.e., you can spin up processes on a single computer at a time. We'll go into more advanced techniques for multi-node distribution later.
Also important is in the mvp2run, the command -c 1 . What this is telling the scheduler to do is to create only one version of simplePar.py on the node. Then, the logic in simplePar.py is going to create additional processes to be processed by the other cores on the machine. This is different than the Random Search, where we intentionally created 12 different copies of the python file.
Python File
In this python file, you'll note the "n_jobs" option within each of the RandomForestClassifiers. This determines the number of processes that are created, and then each process runs some number of the estimators (i.e., classification trees).
Submitting the files & Results
Once you've created your files, you can do a qsub , where j is the name of your job file. Make sure that your job file has the correct conda environment and .py file path specified, as they will likely be different than mine.
Once run, you should find a new output.out file in your directory. Running a cat output.out, you should see a few things, including:
This string simply shows what time the job was processed, and the actual command that mvp2run created for you. Until you get to much more complex models, you won't need this for debugging, but it is nice to have!
Ignoring a few deprecation warnings, after that you will see the output from the test. As you'll see, the multi-core processing is much faster, but not in a linear way. As you introduce more cores (and, eventually, nodes), you'll see that intra-node communication and startup times are all non-trivial costs - so, you want to make sure your problem is big enough to warrant multiple cores before jumping in the deep end!
Last updated