Using command-line parameters

From NLPWiki

Jump to: navigation, search

There are many command-line parameters already built in to the code. Here we will explain the purpose of the existing parameters and how you can use them in your own coding/experiments. Running the help target of the build file will print out each of the built-in switches, along with a description of what each does. Internally, for each valid command-line parameter specified, a (key,value) pair is added to a properties map. You have direct access to this map in the ClassifierFactory.create method, and can probe the properties map for the parameters that have been passed in. The keys used for accessing the parameters are defined in the edu.byu.cs.util.nlp.fileio.Constants class and are applied to specific parameters in the edu.byu.cs.util.CommandLineParser class. Most switches take at least 1 parameter.

Switches are single characters that are specified on the command-line with a preceding dash (e.g., "-e"). If the switch takes an argument, it will be the next white-space delimited sequence of characters on the command-line. Note that when using command-line parameters in ant, the whitespace is not used. For example, -n 15 becomes <arg value="-n15" />.

There is a special command-line parameter -O (stands for other). This parameter allows you to add arbitrary parameters to the properties map that gets created from the command line parameters. For example, if you specify the option

-O bob=hope,alice=cooper

then props.getProperty("bob") = "hope"


Switch Arguments Properties Key Description
You will probably want/need to use at least some of these parameters.
C double [0,1] Constants.COOLING_RATE When using simulated annealing, the cooling rate will be set to this value
d path Constants.DATAROOTDIR Path to root directory of a data set
e None Constants.DO_EM Indicates that EM should be used for training, if possible
i See Constants.DISTRIBUTION_INITIALIZER_TYPES Constants.INITIALIZER Specifies which class should be used to generate an initial distribuition when EM is used
k integer > 0 Constants.K Tells how many clusters the clusterer should create
l path Constants.LISTROOTDIR Directory containing data split
M path Constants.SERIALIZE_IN_FILE File containing a serialized classifier
n integer > 0 Constants.N Tells what value of n to use for the TopN data transformer
o path Constants.SERIALIZE_OUT_FILE File to which a serialized classifier should be written
O key=value[,key=value]* key This is a list of parameters to be added to the properties map. The String value will be put in the map with the String key as its lookup key.
p REUTERS1,REUTERS2, or GROUPS Constants.FILEPARSER Specifies which parser should be used to tokenize the data set
P See Constants.DISTRIBUTION_PERTURBER_TYPES Constants.PERTURBER Specifies which class should be used to perturb distribuitions when simulated annealing is used
t NB_MV,NB_MN,SVM, or MFL Constants.CLASSIFIERTYPE Abbreviation of the name of the classifier to test.
T double [0,1] Constants.TEMPERATURE When using simulated annealing, the starting temperature will be set to this value
u double [0,1] Constants.UNLABELPORTION Portion of documents to be "unlabeled" for a semi-supervised experiment
v COUNT,TFIDF, or TOPN Constants.TRANSFORMER The name of the vectorizer to be used to convert data to a new format.
These parameters might or might not be broken, but could yield interesting results. Test before you rely on these in code you submit
D path Constants.DEBUG_DIR Turn Debug on and save data to this directory.
f IDF,TF,TFIDF,MI,CUTOFF, or TOPN Constants.SERIALIZE_OUT_FILE NOT SUPPORTED (The name of the feature selector to use. This might or might not be broken. Not for use with serialized models.)
K integer > 0 Constants.MINIMUM_FEATURES The minimum number of features to leave per datum during filtering
m path Constants.LOGROOTDIR Write log output to this directory.
s path Constants.STOPWORDSFILE Path to a stop words file, which will be used to prune common words
S None Constants.STEMMING NOT SUPPORTED (Indicates that all features should be stemmed.)
These parameters are almost surely broken. Use at your own risk.
a path Constants.ARFFROOTDIR NOT SUPPORTED (Save the data in the weka ARFF format in the given directory.)
r path Constants.REUSEDATAFILES NOT SUPPORTED (Re-use saved data file trainingData.xml, testData.xml and classes.xml from the given directory.)
W path Constants.DISTRIBUTIONAL_WORD_ CLUSTERING NOT SUPPORTED Enables distributional word clustering using clusters stored in the directory given in the path



Personal tools