CS601R:Project 1 Guidelines
From NLPWiki
Back to CS601R Main Page
Contents |
Project #1: Document Classification with Naïve Bayes
Deadlines
- Early: 1/23/07
- Due Date: 1/25/07
Objectives
This assignment is designed to:
- solidify the ideas of text classification by classifying Usenet articles in the 20 Newsgroups dataset
- build understanding necessary for problems that use text classification (especially spam classification, document sorting, routing, and filtering, language identification, etc.)
- provide hands-on experience with Naïve Bayes, a simple graphical model
- implement two approaches to Naïve Bayes
- give some experience with smoothing probability distributions
- set the stage for semi-supervised and unsupervised learning
- give you a chance to compete for honor and glory in the Project #1 Hall of Fame
Setup
1. Set up your environment by following the directions for acquiring the source code and verifying that it works:
2. Acquire the code for the course as explained here:
3. Download the 20 Newsgroups data set for this assignment from the following URL into a directory adjacent to your code, and follow the directions given above to unpack the data directory.
4. Next, one step will build the codebase, train the most-frequent-label classifier (a simple baseline) on the training set, and run this simple classifier on the development test set (“dev”).
ant lab0 -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/reduced_set
If successful, you will see accuracy figures and a confusion matrix. For reference, the accuracy should be 0.1000.
5. Now you are ready to fill in the methods of the NaïveBayes classifier class. You will actually implement two such classifiers, one with a multivariate Bernoulli event model and one with a multinomial (actually: multivariate categorical) event model, as described in the McCallum and Nigam (1998) paper.
Document Classification
In this project, you are comparing two types of Naïve Bayes models as described in the McCallum and Nigam (1998) paper. Hopefully your results will corroborate the results in that paper. Perhaps more importantly, you will gain experience with the course codebase and be prepared to implement more sophisticated classifiers and clusterers.
We provide a Reader for the data which reads in the “split” of the 20 Newsgroups data set, including both the training and development test. You should only evaluate on the development test set.
The reader creates a Collection of LabeledDatum objects from the provided news files. The difference between a Datum and a LabeledDatum, is simply that a LabeledDatum provides a method for obtaining the label (aka Class) during training. For this project, each LabeledDatum object represents a news item such as the post we examined on the first day of class along with its corresponding label “rec.motorcycles”. The features of these Datums are the ordered lists of tokens comprising the message. Tokenization was accomplished by first discarding all Usenet headers. Next all contiguous sequences of alphabetical characters were culled from the remaining text.
In this project, you’re engaging in supervised learning, so all of the data available for training is labeled. A classifier is trained using the provided framework and the pre-split training and development test sets. The resultant model will then be evaluated for classification accuracy. The trivial MostFrequentLabelClassifier (edu.berkeley.nlp.classify) always chooses sci.space.
Your job here is to build two better classifiers as Java classes which implement ProbabilisticClassifier as follows:
- A Naïve-Bayes classifier using the multinomial (actually: multivariate categorical) event model, smoothed using your choice of smoothing algorithm. Complete the methods in edu.byu.cs.nlp.classify.NaiveBayesClassifierMultinomial .
- A Naïve-Bayes classifier using the multivariate Bernoulli event model, smoothed using your choice of smoothing algorithm. Complete the methods in edu.byu.cs.nlp.classify.NaïveBayesClassifierMultivariateBernoulli .
You most certainly can apply your knowledge of good smoothing algorithms from previous coursework, or you may apply the simple Laplace (add-one) strategy.
There are three pre-built ant targets that will help you debug and test your implementations of these methods. Provided that you place your implementations inside the skeletal classes provided you can use the following commands to run your code:
ant lab1.mn -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/[reduced|full]_set
will train a Multinomial naive bayes model on the full or reduced set (make sure you only specify one). And evaluate it against the development test set.
ant lab1.mvb -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/[reduced|full]_set
will do the same, but for the Multivariate Bernoulli model.
ant lab1 -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/[reduced|full]_set
is the equivalent of running the above two commands in succession. The reduced_set split is provided for your convenience while debugging and tuning your code. There are fewer documents in this split, so you can run experiments with it quickly. All of the experiments you run for inclusion in your final report should be run against the full_set split.
One free parameter with which you will need to experiment is the size of the vocabulary (the feature set) to use. As did McCallum & Nigam, you will also want to "sweep the curve" for several vocabulary sizes. How to choose your vocabulary automatically is up to you. One possibility is to select the top N most frequent vocabulary items. See the file script.sh for an example of how to accomplish this using ant. Note that this solution requires modifying the appropriate targets in your build.xml file.
Once you have identified the trends, pick your best classifier. Use your best classifier to classify the blind
test set using the command:
ant lab1Test -DDATA=<location of newsgroupsCS601R> -DSPLIT=<location of 20_newsgroups>/indices/full_set -DTEST=blind
This will use the last serialized models for the two event models to evaluate to classify the blind test set. Report your accuracy on the blind set. NOTE: You should only test against the blind test set one time. It is considered cheating in the machine learning world to refine your implementation to get a better score on the blind test set.
The highest accuracy on your best classifiers will win a title in the hall of fame, but accuracy is only a secondary goal. Another primary goal of this project is to understand what your best classifier is doing well, what it's doing badly, and why. We provide evaluation code that produces a confusion matrix, showing which label pairs are most often confused.
Report
Turn in a clear, well-structured report that discusses your implementations, describes your experiments with appropriate tables or graphs and accompanying interpretation, and addresses the following questions:
- How are the results surprising?
- Why do you think some pairs are easier or harder?
- Identify any individual errors that are particularly revealing as to why or how the system makes errors.
- Are there cases where you as a human would have trouble? If so, identify a few and discuss them.
- If you were going to invest a large amount of effort into raising the accuracy of this system, what would you do and why?
There is no set length requirement, but I estimate a ball-park of 4 pages for the report. Your report will be graded based on the rubric presented on the following page.
Submit your report in .pdf format as an email to the instructor and the TA.
Rubric
Project #1: Document Classification with Naïve Bayes Name ______________________ Date _______________________ 100 points total: ______ of 25 Discussion and implementation of a Naïve Bayes model with multinomial (actually: multivariate categorical) event model ______ of 25 Discussion and implementation of a Naïve Bayes model with multivariate Bernoulli event model ______ of 20 Presentation and interpretation of experimental results on the development test set. ______ of 25 Analysis of confusion matrix and discussion of questions ______ of 5 Presentation of experimental result on the blind test set Total: ______ of 100 Other Feedback: Notes: ______ Early credit earned on this project ______ Late days used on this project ______ Total late days remaining as of the grading of this project
Acknowledgements
Thanks go to Dan Klein of U.C. Berkeley for the original version of this guidelines document and some of the source code. Thanks also go to BYU students Mark Gulbrandsen, Scott Chun, and Dan Walker for some of the source code.
