FAQ

From NLPWiki

Jump to: navigation, search

If you call "at least once" frequently, then here is a list of frequently asked questions, along with their answers:


  • MFLClassifier has the method getProbabilities(), but I don't see that this is called by ClassifierTester#testClassifier. Is it necessary to implement getProbabilities for the two Naive Bayes classifiers as a public method? (I understand it may be needed internally, but does it need to be public?)

The only requirements are that you implement the Classifier and Serializable interfaces. Neither of these has a getProbabilities() method.


  • I notice that ClassifierFactory#create assumes that the Naive Bayes classifiers will be named NaiveBayesClassifierMultinomial and NaiveBayesClassifierMVBernoulli. I suppose there's no reason to change that, but should we leave those names as is?

There are already dummy classes in the edu.byu.cs.nlp.classify package for these two classifiers and that is what they are called. Those are there for your convenience, if you want to start fresh with a new class, called something different, that's fine. Just make sure the correct classes are instantiated by the factory.


  • How do I read parameter values passed through an Ant script?

When you send in a parameter to ant with the -D option it goes into a variable with the same name as that specified with the -D option. For example, if you pass in the parameter:

-DBOB=MARLY

then you will have a variable named BOB, with the value MARLY accessible inside your ant script. You have probably already used this, for example, when you type -DDATA=somepath, you are creating a variable called DATA with somepath as the value, which is used by the ant script to do something useful:

<arg value="-d${DATA}" />

This passes the argument "-d somepath" to the classifier tester.

The classifier tester takes arguments also, these are passed through ant with the <arg> tag. There are a lot of examples in the build.xml file. Look them over and ask me if you have any other questions about how these are used.

Once parameters get passed into the code-base, they are put into a properties map, most often with a key specified in the Constants class. For more information on how the codebase handles command-line parameters, you should read this tutorial page.

Also, you can look at the file script.sh in the codebase root directory. It gives an example of how you can use command-line parameters to script tasks like sweeping the vocab size.


  • Is there a problem with implementing more command line arguments (looks like this would be done in CommandLineParser.java and Constants.java)?

Actually, this is not really a problem at all. If you would like to look at how other command-line options are implemented in the CommandLineParser and Constants class, then you are welcome to do something similar. We have also added a -O option that takes an arbitrary list of parameters and values and adds them to the properties map. For more details, see the command-line parameters tutorial page.


  • On smoothing, which is mentioned in the project description document, I've wondered why we need to do that? What would happen if we ignored unknown words and considered the words found in training to be the entire universe of words that we care about? Any comments on that? I suppose you could do Good-Turing smoothing to come up with the probability of an unknown word for each class, but I wonder if that is necessary and whether that is actually going to help.

Imagine that word a is in document 1 of the test set, which is of class B. If a and B never co-occured in the training data, then we would calculate that p(a|B) = 0, so the product over all features for that docuement will become 0 and p(B | document 1) = 0. Note that a is not necessarily an "unknown" word, it might be in our vocabulary, it just never co-occured with class B in trainig. That's why we smooth. The paper we discussed in class uses Laplacian (add 1) smoothing, which would be just fine for this lab. It is not obvious to me whether/how much a Naive Bayes model would benefit from a better smoothing algorithm, but it would be an interesting result. I would recommend getting both classifiers working with the math just as described in the paper first, though, and then implementing any improvements (like a better smoothing model), you're considering. That way, if you get stuck on the math, you'll still have something to turn in at the deadline.

Personal tools