Run code on the supercomputer
From NLPWiki
Under construction....
Contents |
Introduction
At some point in this course you may find that your current setup is not capable of providing results for your experiments in a reasonable ammount of time, or at all. You may be running on an older machine, or just one that doesn't have enough RAM to handle the data sets and/or all of the intermediate results. Alternatively, your system might impose constraints that are too limiting, for example, it is difficult to reserve more than about 1 GB of Java heap space on a 32-bit version of Windows. For this reason, I recommend that you learn how to run experiments on the supercomputer soon and make use of that resource.
There are actually several supercomputers, all in the engineering department. Specifically, we will be using Marylou4, as it is a relatively standard Linux cluster on an Intel architecture. There are 630 nodes on Marylou4, each with 2 Dual-core Intel Xeon EM64T processors (2.6GHz) and 8 GB of memory. Most likely, you will only use one core of one processor per process, but you will have access to a large ammount of memory and may run multiple processes simultaneously.
Another advantage to running on the supercomputer is that you will not need to download your own copy of the data sets, as I will have a copy of each in a shared folder available for your use on Marylou4.
Step 1: Get an acccount
Go to this page and follow the directions to sign up for a supercomputer account. When asked for your faculty sponsor, use your academic advisor. For justification for need of account, write about how you are conducting computationally intensive experiements as part of your research into document clustering and classification.
Step 2: Write your code
If you want to use an IDE, write your code locally on your own machine. Marylou4 is pretty much a shell-only environment, so any editing you do on the supercomputer itself will need to be done in vim or emacs. Make sure to run a small test with the ant script, using the targets that you will want to use to run your experiments on Marylou4. This will ensure that the ammount of debugging you need to do on the supercomputer itself will be minimal.
Step 3: Move your code to Marylou4
Use the package target first, and then use scp to copy the packaged archive to your directory on Marylou4.
Step 4: Submitting and managing jobs
Option 1: Manually Submitting Each Job
To submit jobs on Marylou4, you should use the PBS job scheduler. The job scheduler takes a shell script, with special meta information encoded in comment fields as input. Included in the code from subversion, you will find a file named runPBS.sh. This is a very rudimentary PBS submission script that will run one ant target from your build.xml file.
There are several places in runPBS.sh that you should change. These are all marked with the tag TODO. Search for that word and fill in the information asked for.
To submit a job, use the qsub command:
qsub runPBS.sh
you should be in the root directory of your code (the one containing the build.xml file) when you run this command. This will submit your job to the scheduling queue. It will also produce some output that looks like this:
[NUMBER].m4bi
where the number represents a job number for your reference. When the job completes, two output files will be placed in the directory from which you ran qsub, cs601R.experiment.e[NUMBER], and cs601R.experiment.e[NUMBER]. These contain the standard error and standard out output produced by your job.
To view the status of your job, use the qstat command. Supplying the "-a" switch will produce more detailed information about the jobs in the queue.
To stop your job from running, use the qdel [NUMBER] command, where the number is the job number associated with the job you want to remove from the scheduling queue.
Option 2: Automatically Generating a Set of Jobs
There is a python script in the codebase that can help you get started with running experiments on the supercomputer. Here is an updated version of that script, with some of the paths corrected. You need to change the emailAddress and workingDir values. You ma also want to change the parameters being passed to your binary. For example, the script has hardcoded -DN=10. You may want to adjust that for the particular brand of feature selection you are using (if any). You may also need to adjust the walltime parameter if your jobs are being killed prematurely.
This script will submit runsPerK jobs for every value of K that you specify. The way the code is now, it will run 3 jobs each for k \in {5,10,15,20}.
#!/usr/bin/env python
import os
import time
# TODO: Change these parameters to customize your experiments
emailAddress="somebody@somewhere.com"
maxK=20#0
runsPerK=3
startK = 5
endK = 20
increment = 5
workingDir="/fslhome/pathtomyworkingdirectory"
# Make sure the correct parameters get used with the correct data sets
# TODO: If you comment out or remove a dataset from this list, then
# no experiments will be run for it
data = ["del.icio.us", "reut", "news"]
clusterers = ["NB_EM"]
# These dictionaries map the dataset name to the appropriate paramers
dataSets = {"reut":"/fslhome/ddw28/compute/data/reuters", \
"del.icio.us" : "/fslhome/ddw28/compute/data/del.icio.us", \
"news" : "/fslhome/ddw28/compute/data/20_newsgroups", \
"movies" : "/fslhome/ddw28/compute/data/MovieReviews",
"enron" : "/fslhome/ddw28/compute/data/enron"}
dataSplits = {"del.icio.us" :
"/fslhome/ddw28/compute/data/del.icio.us/new_indices/reducedSplit" ,
"reut" :
"/fslhome/ddw28/compute/data/reuters/indices/reduced_set",
"news":
"/fslhome/ddw28/compute/data/20_newsgroups/indices/reduced_set",
"movies" : "/fslhome/ddw28/compute/data/MovieReviews/indices",
"enron" : "/fslhome/ddw28/compute/data/enron/indices/ldc_split"}
dataParsers = {"del.icio.us" : "HTML" , "reut" : "REUTERS1",
"news" : "GROUPS", "movies" : "MOVIE", "enron" : "GROUPS"}
ant = "/fslhome/ddw28/compute/apps/ant-1.7.0/bin/ant"
for currentData in data :
for k in range(startK, endK, increment) :
for i in range(1,(runsPerK+1)) :
for currentClusterer in clusterers :
cmd = '%s clustering.generic -DH=.1 -DDATA=%s -DSPLIT=%s -DN=10 -DLENGTH=10000 -DK=%s -DP=%s -DC=%s -DTYPE=NB_EM' % \
(ant, dataSets[currentData],dataSplits[currentData], k,dataParsers[currentData], currentClusterer)
qsub = 'qsub -l nodes=1:ppn=4,walltime=5:00:00 -m abe -M %s -N %s-%d-%s -v %s -d %s' % \
(emailAddress, currentClusterer, k, currentData,"JAVA_HOME=/fslhome/ddw28/compute/apps/jdk1.6.0_03", workingDir)
(stdout, stdin) = os.popen2(qsub)
stdout.write(cmd)
stdout.flush()
stdout.close()
stdin.close()
time.sleep(1)
