CS601R:Project 2 Guidelines

From NLPWiki

Jump to: navigation, search

Back to CS601R Main Page

Contents

Project #2: Document Classification with Support Vector Machines

Deadlines

  • Early: 2/6/07
  • Due Date: 2/8/07

Objectives

This assignment is designed to:

  • provide hands-on experience with Support Vector Machines, a state-of-the-art approach to classification
  • provide a point of comparison in text classification with Naïve Bayes (from Project #1) by classifying Usenet articles in the 20 Newsgroups dataset
  • engage in the feature engineering process to identify features that are most useful for high accuracy text classification (at least for this data set)
  • build additional understanding of text classification for real problems (e.g., spam classification, document sorting, routing, and filtering, language identification, etc.)
  • Optional (not required): begin to engage in "kernel engineering" by experimenting with multiple kernels
  • give you a chance to compete for honor and glory in the Project #2 Hall of Fame

Setup

1. You should have a working version of the class codebase as a result of your completion of Project #1. If necessary, consult the following directions to get going:

How to prepare your system

Get a copy of the code

2. You should also have a copy of the 20 Newsgroups data set for this assignment. If necessary, follow the directions given above to unpack the data directory.

Data

3. Finally, you should acquire a copy of the LIBSVM toolkit from the following URL:

LIBSVM

You will use the appropriate link on that page to download the software for your platform.

4. Make a copy of the Classifier tester class, and un-comment the lines that transform the data and write the indexed and transformed data out to libsvm compatible files.

5. Now you are ready to train SVMs from the serialized vectors using the external tool.

6. Although it has been suggested that you could use the supercomputer to accelerate your experimentation, it is not strictly necessary. We have actually simplified the lab significantly with an eye toward enabling you to complete the lab successfully on a normal workstation.

Background

In this project, you are experimenting with features extracted from the documents in the 20 Newsgroups data set using a linear kernel in the framework of Support Vector Machines. This constitutes a simplified set of experiments, along the lines of the experiments described in the paper by Joachims (1998). Hopefully your results will corroborate some of the results in that paper. You will also compare the SVM results with your Project #1 results from Naïve Bayes on the same data sets.


In the course codebase, we provide a Reader for the data to read in the "split" of the 20 Newsgroups data set, including both the training and development test sets. As discussed in class, the training set is for training models, data inspection, and for feature engineering; the development test set is for error analysis and evaluation. We allow ourselves to conduct error analysis on the development test set because we have a blind test set. If there were no blind test set, we would maintain a stricter discipline and avoid error analysis on the development test set. You will also evaluate your models on the blind test set. We trust you to not even look at the blind test set, and to only run against it once.


As in Project #1, you are engaging in supervised learning, so all of the data available for training is labeled. The Reader creates a Collection of LabeledDatum objects from the provided news files. Each LabeledDatum object represents a single news item or document. The features of these Datums are the ordered lists of tokens comprising the message. Tokenization was accomplished by first discarding all Usenet headers. Next all contiguous sequences of alphabetical characters were culled from the remaining text and converted to lower case.


In this project, we are only using the Reader and a FeatureTransformer in the course codebase to extract the desired features into feature vectors for subsequent processing outside of this codebase.

Feature Engineering

As discussed in class, feature engineering consists of the following steps:

  1. define an initial set of features
  2. extract feature vectors for the training and development test sets
  3. train a model from the training set
  4. evaluate on the training set and the development test set
  5. characterize the performance of the model on the dev. test set using suitable metrics; if satisfied with performance, quit
  6. conduct error analysis; for classification, a confusion matrix is a useful tool for identifying classes of errors
  7. identify additional features that will be useful in discriminating amongst the confused cases, perhaps focusing on the cell in the confusion matrix where greatest confusion occurs
  8. return to step #2


You will accomplish steps #1 and #2 in our class codebase. Write your code in such a way that it creates consistently-indexed, libsvm-formatted data files for each of the 3 parts of the data split. That is, you should have a training data file, a dev-test data file and a blind-test data file. "Consistently-indexed" means that if class "alt.atheism" is mapped to 1 in the training set, it needs to map to 1 in both of the test sets. Also, if a feature X maps to n in training, it needs to map to n in both test sets as well. See the "Codebase Resources" section below for some hints and helps on implementing the indexing functionality and writing data out in the appropriate format. You will evaluate on the blind-test data only once.


You will need to define feature extraction functions by extending the FeatureTransformer for steps #1 and #7. You may also decide to expose additional information about the original Usenet articles by modifying the Reader. For example, you may wish to include capitalization information, or you may wish to include aspects of the article header (excluding the true label, of course!).


As advised by Joachims, we recommend that you normalize your feature vectors to unit length.


Use the "linear" polynomial kernel with a slack parameter of 10 as the basis for your feature engineering process. Describe your feature engineering process, including your efforts at error analysis, and your final feature set. These will be important elements of your project report.


Steps #3-#5 will be accomplished with the LIBSVM tools. In particular, you will use the following options for the LIBSVM tool:

-s   0   denotes an SVM for classification (C-SVC)

-t   {0, 1, 2, 3}   set type of kernel function (default 2)
* 0 -- linear: u'*v
* 1 -- polynomial: (gamma*u'*v + coef0)^degree
* 2 -- radial basis function: exp(-gamma*|u-v|^2)
* 3 -- sigmoid: tanh(gamma*u'*v + coef0)

-c    cost   this is the "slack" parameter


Note that as you change your feature set, you may benefit from tuning your slack parameter. Do not spend excessive time on this. Consider orders of magnitude (e.g., 10, 1, 0.1, 0.001, …).


The crux of step #6 is to understand what your classifiers are doing well, what they’re doing badly, and why. Although we provided evaluation code in our class codebase that produces a confusion matrix, we are not using our codebase’s implementation of a classifier. Classification is being performed by LIBSVM. Consequently, you will need to digest the output of LIBSVM (using any programming language you prefer) and generate a confusion matrix to facilitate your feature engineering efforts.


Each step of the cycle should be well documented in your report. You should identify the following for each of at least four cycles:

  • features introduced
  • experimental results
  • error analysis
  • reasons for new features

Codebase Resources

This section will help you utilize the resources of the class code-base to accomplish the feature engineering portion of the lab.

Extracting Data

The NewsGroupParser class is responsible for extracting text from the 20 newgroup data files and creating individual feature tokens. You may want to consider modifying this class (or creating a new class with similar functionality) so that you can gain access to more/different features in the document. Here are some things you might want to try:

  • Include capitalization information in tokens
  • Include tokens from the file headers
  • Include information about number tokens, for example, you might create a token <NUMBER> that you include every time a number is mentioned in a document

Vectorizing and Indexing Data

Once you have a set of LabeledDatum objects that contain all of the features that you wanted to extract from the data files, there is support code that will help you to accomplish indexing and vectorizing simply.

First, to convert a LabeledDatum<String,String> to an indexed version of type LabeledDatum<String, Integer>, you can use the StringtoIntegerIndexingTransformer class. Also, to convert an indexed set of datums to count-vectorized VectorLabeledDatums, you can use the CountVectorizer class. This class has a toString() method that, assuming the labels and features have been converted to integer indexes, will output the vector in the format used by libsvm for its training and data files. For example, assume that I have a collection: Collection<LabeledDatum<String,String>> training data, I can convert this to integer-indexed vectors and print them out to the screen in libsvm format by using the following code:

	DataTransformer vectorizer = new CountVectorizer<Integer,Integer>();
	DataTransformer integerDT = new StringtoIntegerIndexingTransformer();
	List<LabeledDatum<Integer, String>> processedData = vectorizer.transform( integerDT.transform( trainingData));

	for(LabeledDatum<Integer, String> currentDatum : processedData)
	{
		System.out.printf("Current Vector: %s\n", currentDatum.toString());
	}

Note that VectorLabeledDatum also has a methods that will nomalize the vectors to unit-length. See the documentation for VectorLabeledDatum for more information under the two normalize methods.

Choosing Parameters etc.

You may want to use the easynoscale.py script in the code-base root for an easy way to run libsvm that automatically selects parameters for the kernel you are using.

Report

In addition to documenting the feature engineering process in the report, address the following questions:

  • What raw features did you add and why?
  • How did the feature engineering process improve accuracy on this dataset?
  • Can you propose a kernel that might be helpful in computing similarity between two documents?


Turn in a clear, well-structured report that discusses your implementations, describes your experiments with appropriate tables or graphs and accompanying interpretation, and addresses the above questions. There is no set length requirement, but I estimate a ball-park of 4 pages for the report. Your report will be graded based on the rubric presented on the following page.


Rubric

Project #2:  Document Classification with Support Vector Machines

Name ______________________

Date _______________________


100 points total:

______ of 40	Discussion of each step of the feature engineering process using an SVM with the linear kernel, including:
                your features (20)
                your error analysis and your reasons for introducing each feature (20)

______ of 30    Experimental results from each step of the feature engineering process (probably interleaved with the above discussion)
	
______ of 10	Presentation and interpretation of final experimental results on the development test set, compared with your Project #1 Naïve Bayes results

______ of 5     Final experimental result on the blind test set

______ of 15	Discussion of questions


Total:
	______ of 100

Other Feedback:







Notes:

______ Early credit earned on this project

______ Late days used on this project

______ Total late days remaining as of the grading of this project

Personal tools