CS601R:Project 1 Guidelines

From NLPWiki

Jump to: navigation, search

Back to CS601R Main Page

Contents

Project #1: Document Classification with Naïve Bayes

Deadlines

  • Early: 1/23/07
  • Due Date: 1/25/07


Objectives

This assignment is designed to:

  • solidify the ideas of text classification by classifying Usenet articles in the 20 Newsgroups dataset
  • build understanding necessary for problems that use text classification (especially spam classification, document sorting, routing, and filtering, language identification, etc.)
  • provide hands-on experience with Naïve Bayes, a simple graphical model
  • implement two approaches to Naïve Bayes
  • give some experience with smoothing probability distributions
  • set the stage for semi-supervised and unsupervised learning
  • give you a chance to compete for honor and glory in the Project #1 Hall of Fame


Setup

1. Set up your environment by following the directions for acquiring the source code and verifying that it works:

How to prepare your system

2. Acquire the code for the course as explained here:

Get a copy of the code

3. Download the 20 Newsgroups data set for this assignment from the following URL into a directory adjacent to your code, and follow the directions given above to unpack the data directory.

Data

4. Next, one step will build the codebase, train the most-frequent-label classifier (a simple baseline) on the training set, and run this simple classifier on the development test set (“dev”).

ant lab0 -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/reduced_set

If successful, you will see accuracy figures and a confusion matrix. For reference, the accuracy should be 0.1000.

5. Now you are ready to fill in the methods of the NaïveBayes classifier class. You will actually implement two such classifiers, one with a multivariate Bernoulli event model and one with a multinomial (actually: multivariate categorical) event model, as described in the McCallum and Nigam (1998) paper.

Document Classification

In this project, you are comparing two types of Naïve Bayes models as described in the McCallum and Nigam (1998) paper. Hopefully your results will corroborate the results in that paper. Perhaps more importantly, you will gain experience with the course codebase and be prepared to implement more sophisticated classifiers and clusterers.


We provide a Reader for the data which reads in the “split” of the 20 Newsgroups data set, including both the training and development test. You should only evaluate on the development test set.

The reader creates a Collection of LabeledDatum objects from the provided news files. The difference between a Datum and a LabeledDatum, is simply that a LabeledDatum provides a method for obtaining the label (aka Class) during training. For this project, each LabeledDatum object represents a news item such as the post we examined on the first day of class along with its corresponding label “rec.motorcycles”. The features of these Datums are the ordered lists of tokens comprising the message. Tokenization was accomplished by first discarding all Usenet headers. Next all contiguous sequences of alphabetical characters were culled from the remaining text.


In this project, you’re engaging in supervised learning, so all of the data available for training is labeled. A classifier is trained using the provided framework and the pre-split training and development test sets. The resultant model will then be evaluated for classification accuracy. The trivial MostFrequentLabelClassifier (edu.berkeley.nlp.classify) always chooses sci.space.


Your job here is to build two better classifiers as Java classes which implement ProbabilisticClassifier as follows:

  • A Naïve-Bayes classifier using the multinomial (actually: multivariate categorical) event model, smoothed using your choice of smoothing algorithm. Complete the methods in edu.byu.cs.nlp.classify.NaiveBayesClassifierMultinomial .
  • A Naïve-Bayes classifier using the multivariate Bernoulli event model, smoothed using your choice of smoothing algorithm. Complete the methods in edu.byu.cs.nlp.classify.NaïveBayesClassifierMultivariateBernoulli .

You most certainly can apply your knowledge of good smoothing algorithms from previous coursework, or you may apply the simple Laplace (add-one) strategy.


There are three pre-built ant targets that will help you debug and test your implementations of these methods. Provided that you place your implementations inside the skeletal classes provided you can use the following commands to run your code:

ant  lab1.mn -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/[reduced|full]_set

will train a Multinomial naive bayes model on the full or reduced set (make sure you only specify one). And evaluate it against the development test set.

ant  lab1.mvb -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/[reduced|full]_set

will do the same, but for the Multivariate Bernoulli model.

ant  lab1 -DDATA=<location of 20_newsgroups> -DSPLIT=<location of 20_newsgroups>/indices/[reduced|full]_set

is the equivalent of running the above two commands in succession. The reduced_set split is provided for your convenience while debugging and tuning your code. There are fewer documents in this split, so you can run experiments with it quickly. All of the experiments you run for inclusion in your final report should be run against the full_set split.


One free parameter with which you will need to experiment is the size of the vocabulary (the feature set) to use. As did McCallum & Nigam, you will also want to "sweep the curve" for several vocabulary sizes. How to choose your vocabulary automatically is up to you. One possibility is to select the top N most frequent vocabulary items. See the file script.sh for an example of how to accomplish this using ant. Note that this solution requires modifying the appropriate targets in your build.xml file.


Once you have identified the trends, pick your best classifier. Use your best classifier to classify the blind test set using the command:

ant  lab1Test -DDATA=<location of newsgroupsCS601R> -DSPLIT=<location of 20_newsgroups>/indices/full_set -DTEST=blind

This will use the last serialized models for the two event models to evaluate to classify the blind test set. Report your accuracy on the blind set. NOTE: You should only test against the blind test set one time. It is considered cheating in the machine learning world to refine your implementation to get a better score on the blind test set.


The highest accuracy on your best classifiers will win a title in the hall of fame, but accuracy is only a secondary goal. Another primary goal of this project is to understand what your best classifier is doing well, what it's doing badly, and why. We provide evaluation code that produces a confusion matrix, showing which label pairs are most often confused.

Report

Turn in a clear, well-structured report that discusses your implementations, describes your experiments with appropriate tables or graphs and accompanying interpretation, and addresses the following questions:

  • How are the results surprising?
  • Why do you think some pairs are easier or harder?
  • Identify any individual errors that are particularly revealing as to why or how the system makes errors.
  • Are there cases where you as a human would have trouble? If so, identify a few and discuss them.
  • If you were going to invest a large amount of effort into raising the accuracy of this system, what would you do and why?


There is no set length requirement, but I estimate a ball-park of 4 pages for the report. Your report will be graded based on the rubric presented on the following page.

Submit your report in .pdf format as an email to the instructor and the TA.


Rubric

Project #1:  Document Classification with Naïve Bayes 

Name ______________________

Date _______________________


100 points total:

______ of 25	Discussion and implementation of a Naïve Bayes model with multinomial (actually: multivariate categorical) event model
	
______ of 25	Discussion and implementation of a Naïve Bayes model with multivariate Bernoulli event model

______ of 20	Presentation and interpretation of experimental results on the development test set.

______ of 25	Analysis of confusion matrix and discussion of questions

______ of 5    Presentation of experimental result on the blind test set

Total:  ______ of 100


Other Feedback:







Notes:

______ Early credit earned on this project

______ Late days used on this project

______ Total late days remaining as of the grading of this project

Acknowledgements

Thanks go to Dan Klein of U.C. Berkeley for the original version of this guidelines document and some of the source code. Thanks also go to BYU students Mark Gulbrandsen, Scott Chun, and Dan Walker for some of the source code.

Personal tools