Using command-line parameters
From NLPWiki
There are many command-line parameters already built in to the code. Here we will explain the purpose of the existing parameters and how you can use them in your own coding/experiments. Running the help target of the build file will print out each of the built-in switches, along with a description of what each does. Internally, for each valid command-line parameter specified, a (key,value) pair is added to a properties map. You have direct access to this map in the ClassifierFactory.create method, and can probe the properties map for the parameters that have been passed in. The keys used for accessing the parameters are defined in the edu.byu.cs.util.nlp.fileio.Constants class and are applied to specific parameters in the edu.byu.cs.util.CommandLineParser class. Most switches take at least 1 parameter.
Switches are single characters that are specified on the command-line with a preceding dash (e.g., "-e"). If the switch takes an argument, it will be the next white-space delimited sequence of characters on the command-line. Note that when using command-line parameters in ant, the whitespace is not used. For example, -n 15 becomes <arg value="-n15" />.
There is a special command-line parameter -O (stands for other). This parameter allows you to add arbitrary parameters to the properties map that gets created from the command line parameters. For example, if you specify the option
-O bob=hope,alice=cooper
then props.getProperty("bob") = "hope"
| Switch | Arguments | Properties Key | Description |
|---|---|---|---|
| You will probably want/need to use at least some of these parameters. | |||
| C | double [0,1] | Constants.COOLING_RATE | When using simulated annealing, the cooling rate will be set to this value |
| d | path | Constants.DATAROOTDIR | Path to root directory of a data set |
| e | None | Constants.DO_EM | Indicates that EM should be used for training, if possible |
| i | See Constants.DISTRIBUTION_INITIALIZER_TYPES | Constants.INITIALIZER | Specifies which class should be used to generate an initial distribuition when EM is used |
| k | integer > 0 | Constants.K | Tells how many clusters the clusterer should create |
| l | path | Constants.LISTROOTDIR | Directory containing data split |
| M | path | Constants.SERIALIZE_IN_FILE | File containing a serialized classifier |
| n | integer > 0 | Constants.N | Tells what value of n to use for the TopN data transformer |
| o | path | Constants.SERIALIZE_OUT_FILE | File to which a serialized classifier should be written |
| O | key=value[,key=value]* | key | This is a list of parameters to be added to the properties map. The String value will be put in the map with the String key as its lookup key. |
| p | REUTERS1,REUTERS2, or GROUPS | Constants.FILEPARSER | Specifies which parser should be used to tokenize the data set |
| P | See Constants.DISTRIBUTION_PERTURBER_TYPES | Constants.PERTURBER | Specifies which class should be used to perturb distribuitions when simulated annealing is used |
| t | NB_MV,NB_MN,SVM, or MFL | Constants.CLASSIFIERTYPE | Abbreviation of the name of the classifier to test. |
| T | double [0,1] | Constants.TEMPERATURE | When using simulated annealing, the starting temperature will be set to this value |
| u | double [0,1] | Constants.UNLABELPORTION | Portion of documents to be "unlabeled" for a semi-supervised experiment |
| v | COUNT,TFIDF, or TOPN | Constants.TRANSFORMER | The name of the vectorizer to be used to convert data to a new format. |
| These parameters might or might not be broken, but could yield interesting results. Test before you rely on these in code you submit | |||
| D | path | Constants.DEBUG_DIR | Turn Debug on and save data to this directory. |
| f | IDF,TF,TFIDF,MI,CUTOFF, or TOPN | Constants.SERIALIZE_OUT_FILE | NOT SUPPORTED (The name of the feature selector to use. This might or might not be broken. Not for use with serialized models.) |
| K | integer > 0 | Constants.MINIMUM_FEATURES | The minimum number of features to leave per datum during filtering |
| m | path | Constants.LOGROOTDIR | Write log output to this directory. |
| s | path | Constants.STOPWORDSFILE | Path to a stop words file, which will be used to prune common words |
| S | None | Constants.STEMMING | NOT SUPPORTED (Indicates that all features should be stemmed.) |
| These parameters are almost surely broken. Use at your own risk. | |||
| a | path | Constants.ARFFROOTDIR | NOT SUPPORTED (Save the data in the weka ARFF format in the given directory.) |
| r | path | Constants.REUSEDATAFILES | NOT SUPPORTED (Re-use saved data file trainingData.xml, testData.xml and classes.xml from the given directory.) |
| W | path | Constants.DISTRIBUTIONAL_WORD_ CLUSTERING | NOT SUPPORTED Enables distributional word clustering using clusters stored in the directory given in the path
|
