CS601R:Project 4 Guidelines
From NLPWiki
Back to CS601R Main Page
Contents |
Project #4: Document Clustering Using EM on a Mixture of Multinomials
Deadlines
- Early: 3/12/07
- Due Date: 3/14/07
Objectives
This assignment is designed to:
- give you a hands-on introduction to document clustering
- provide experience with expectation maximization on a straightforward probabilistic model, the mixture of multinomials (i.e., multinomial Naïve Bayes with an unobserved class label)
- provide comparative experience with "classification" expectation maximization (CEM) or "hard" EM
- give experience with the vector-space k-means clustering algorithm as a baseline for comparison
- get you involved in evaluating cluster quality by algorithmic means
- introduce a new data set: the Reuters news corpus
- help you assess the time efficiency of these algorithms
- give you a chance to compete for honor and glory in the Project #4 Hall of Fame
Setup
1. You should have a working version of the class codebase as a result of your work on the earlier projects. If necessary, consult the following directions to get going:
2. Be sure to incorporate the k-means update from Dan in your codebase.
3. You will need a copy of the Reuters data set and one other (the given reduced set of 20 Newsgroups or the given reduced set of del.icio.us) for this assignment. You are not allowed to redistribute the Reuters data. You must read the license terms and submit a signed copy of the signature form to Dr. Ringger in order to have permission to use the data. Both documents are available in the "Course Materials" section of this course’s Blackboard pages. Submit the signature form in class ASAP.
4. Retrieve the latest copy of the data from the following URLs into directories adjacent to your code, and follow the directions given on the above wiki page to unpack the data directory. Note that the data directories now require authentication.
5. Out of the box, one build target will simply cluster the given documents using k-means. If successful, you will see clustering metrics and a "supervised" confusion matrix; this will serve as your baseline. Note that there is no "-DTEST" argument, since there is no test set to speak of.
ant cluster.example -DDATA=<location of 20 Newsgroups Data> -DN=2 -DSPLIT=<location of 20 Newsgroups data>/indices/[full|reduced]_set
In addition, there are targets that will run the cluster test harness against the Reuters data set and the other data sets, respectively:
ant lab4.reuters -DDATA=<location of reuters data> -DSPLIT=<location of reuters data>/indices/[full|reduced]_set
ant lab4.newsgroups -DDATA=<location of 20 Newsgroups data> -DSPLIT=<location of 20 Newsgroups data>/indices
ant lab4.del.icio.us -DDATA=<path to del.icio.us data> -DSPLIT=<path to del.icio.us data>/new_indices/[allDevSplit|reducedSplit]
These targets are set up to create an instance of the stub class edu.byu.cs.nlp.classify.NaiveBayesEMClusterer which you may use as a starting point for this lab. You will probably need to modify these targets, depending on the command-line parameters you use.
6. For this lab, you will implement EM on a mixture of multinomials (i.e., multinomial Naïve Bayes with unobserved classes). All work will be done in our class codebase. The build target for each dataset should build the codebase, read the data, extract features into document feature vectors, cluster the data using your EM algorithm using your best value of k (number of clusters), and compute the "supervised" confusion matrix as well as the "unsupervised" divergence matrix, divergence vector, and summary divergence metrics (described below).
Background
In this project, you are working with labeled data; however, you are going to pretend that it is unlabeled for the purpose of clustering. The labels will come back later to help us do cluster analysis. More on that below.
You will work with 1990s news data from Reuters; you are not allowed to redistribute this data and must read the license terms and submit a signed copy of the signature form to Dr. Ringger in order to have permission to use the data. Both documents are available in the "Course Materials" section of this course’s Blackboard pages.
You will also work with a given subset of the 20 Newsgroups data or the del.icio.us data set from our previous projects. Another option is the del.icio.us data. You choose which second set you want to work with.
You will want to employ some reasonable (preferably simple) kind of feature selection or reduction. Make a choice, advocate your choice, and describe your choice. Note that techniques like distributional mutual information (aka information gain) are not legal, since they would require looking at the labels. Looking at the labels is cheating. Even more so than in document classification, features matter in document clustering, because they establish which patterns will be salient to the algorithm. If you find that your clusterer is forcing everything into just a cluster or two, then you probably need to reconsider the features you are extracting from your documents.
Your goal is to build a good document clustering system. You will implement EM on a mixture of multinomials. What do we mean by a "mixture of multinomials"? That’s another name for the familiar "Naïve Bayes" graphical model with the top node (representing the random variable for document class) being unobserved for every document. Consequently, EM will allow for P(c_j | d_i) to be potentially non-zero for each document d_i and each class c_j. The multinomial distribution given a specific class c_j is also called a "mixture component".
How many mixture components should you assume? We will use the variable k to represent this number. It is unclear what k should be, so you will run experiments beginning with k=2. You should do a line search for the value of k that yields the maximal log likelihood of a held-out set of data, P(D | model parameters, including k) in EM. Do so for each of the two data sets independently. Capture this process and the result in suitable graphs. Note that you may or may not observe a peak in this curve. If this is the case, then you may choose k equal to the number of labeled classes in the given data set.
Regarding initialization methods, Meila and Heckerman give 3 ways that they initialized their distributions, but their results showed that it was basically a wash (on the real data set) as to which provided the best cluster quality. You should pick one of their methods and describe your approach thoroughly. Robbie Haertel has made the following observation: "I think there is an issue with Meila & Heckerman’s data-dependent initialization scheme ("Noisy Marginal", aka "Marginal"): it very readily introduces 0s into the parameter vectors for the multinomials, causing all sorts of problems, as far as I can tell." You may want to stick with the random or HAC initialization schemes in order to avoid this issue until we figure out a way around this.
After you have implemented EM, also create a "Classification EM" algorithm, by making the necessary changes to your EM implementation, as discussed in class.
Baseline
Once you have settled on a value of k for a particular data set, you will run the baseline k-means algorithm with the same value of k. Having done so, you will be ready to evaluate the clusters. Note that our implementation of k-means is in document vector space and regards the documents as vectors in that space; as such it is not possible to tune k directly on k-means, since there is no notion of data likelihood in that paradigm.
Automatic Evaluation
For each data set, you should now have three clusterings, one using EM for the optimal value of k (with respect to log-likelihood), one using CEM, and one using k-means (using that same k value). You may use the reduced data set splits in all cases.
The code provided for you implements several metrics to evaluate the clusterings. Report the values of the metrics for each value of k, using EM, and for the chosen value of k, using CEM and k-means. The metrics are:
- F-Measure (Steinbach, et al., 2000)
- Entropy (Steinbach, et al., 2000)
- Variation of Information (Meila, 2002)
- Adjusted Rand Index (Yeung, Ruzzo, 2001)
- V-Measure (Rosenberg, and Hirschberg, 2007)
- Q_2 (Byron Dom, 2001)
In addition to these, the code generates a "divergence matrix" and total convergence/divergence values. The divergence matrix may exist in the literature, but if so, we are not aware of it. To our knowledge (currently), this is an idea original to this project. It’s actually quite straightforward. Lay out a matrix with cluster ids along the columns and the rows. In cell D[i, j] of the divergence matrix D, set D[i,j] = KL(P_i || P_j), the K-L divergence of distribution P_i from distribution P_j. Note that D is not symmetric. This metric is part of on-going research, and you can ignore it for this lab.
With regard to experimental procedures, since the clustering algorithms start out with initializations having a random element, running an experiment once doesn't give an accurate picture of the algorithm's performance. For each experiment, run it at least three times and report the averages for each metric along with the metric values for each run. Thus, your tables should include data for:
- F-Measure
- Entropy
- Variation of Information
- Adjusted Rand Index
- V-Measure
- Q_2
- Log likelihood of the data (EM/CEM only)
- Time to completion
Qualitative Analysis
You will also engage in extensive qualitative evaluation of the best clustering of one of your data sets. You should use any of the following means to get a feel for your output:
- Cluster browser
- Confusion matrix
- The top N words according to p(c_i | word) for each class
Discuss what you find and explain the degree to which this is a satisfying clustering of the data.
Report
Turn in a clear, well-structured, self-contained report that discusses your implementations, describes your experiments with appropriate tables or graphs and accompanying interpretation, and shares any other insights you think will be helpful in interpreting your work. There is no set length requirement, but I estimate a ball-park of 4-5 pages for the report. Your report will be graded based on the rubric presented on the following page.
Hints
Drawing from a Dirichlet
There is a library in the codebase that makes sampling from the Dirichlet easy. The documentation for that library can be found here: http://www.arbylon.net/projects/knowceans-tools/doc/.
This code should make it easier to implement the initialization techniques from the Meila and Heckerman paper. Here's some sample code for using that library to draw 10 vectors of 20 elements in length from an uninformative (uniform) Dirichlet:
import org.knowceans.util.Cokus; import org.knowceans.util.Samplers; ... // These are the "alphas" double [] allOnes = new double[20]; for(int i = 0; i < allOnes.length; i++) allOnes[i] = 1; // Important to seed so that numbers are different on each run Cokus.seed((int)new Date().getTime()); // These are the thetas double [][] samples = Samplers.randDir(allOnes, 10);
Rubric v2
Project #4: Document Clustering Using EM on a Mixture of Multinomials Name ______________________ Date _______________________ 100 points total: ______ of 30 Implementation design * Feature selection * Feature reduction * Model initialization * Brief description of learning algorithm * Mathematical equations representing distributions and other quantities calculated/estimated as part of the operation of the algorithm ______ of 20 Experimental methodology * Correct use of data (training/dev test/blind test/held-out * Comparison to reasonable baseline * Experiment design choices are explained and justified/justifiable. ______ or 20 Evaluation * Trends are presented * Aberrations are noted and causes are investigated * Conclusions drawn are logical and justified by the experimental data ______ of 20 Follow lab instructions * It should be clear from your write-up that you followed the directions given in the lab specifications. For lab 4, this includes, but is not limited to, the following: * Discussion of your cluster initialization method * Discussion of your approach to feature selection and document vector representation * Implementation of Expectation Maximization (EM) and Classification EM (CEM) * Discussion and experimental results, including metrics, on the Reuters data set * Discussion and experimental results, including metrics, on a second data set (subset of 20 Newsgroups, or subset of del.icio.us) * Qualitative analysis of best clustering on one data set (your choice) ______ of 10 Clear/organized writing * Grammar and spelling mistakes should be few and far-between * All figures and tables should be labeled with descriptive captions. They should be easy to read/interpret. Keep tables reasonably sized, and well organized. Make sure that curves in figures are clearly labeled and distinguishable, even when printed in black-and-white. * Paper organization should be logical, and should help lead the reader through work, helping the reader to see how conclusions follow from the methods and experimental results. * Writing should be professional in tone, avoiding: colloquialisms, contractions, first and second person references to the the author or reader, and the passive voice. Approach the report as though it were an academic paper to be submitted for publication in a conference or as a technical report. Total: ______ of 100 Other Feedback: Notes: ______ Early credit earned on this project ______ Late days used on this project ______ Total late days remaining as of the grading of this project
Rubric v1
Project #4: Document Clustering Using EM on a Mixture of Multinomials
Name ______________________
Date _______________________
100 points total:
______ of 10 Discussion of your cluster initialization method
______ of 10 Discussion of your approach to feature selection and document vector representation
______ of 40 Implementation of Expectation Maximization (EM) and Classification EM (CEM)
______ of 10 Discussion and experimental results, including metrics, on the Reuters data set
______ of 10 Discussion and experimental results, including metrics, on a second data set
(subset of 20 Newsgroups, or subset of del.icio.us)
______ of 20 Qualitative analysis of best clustering on one data set (your choice)
Total:
______ of 100
Other Feedback:
Notes:
______ Early credit earned on this project
______ Late days used on this project
______ Total late days remaining as of the grading of this project
Rubric v1
Project #4: Document Clustering Using EM on a Mixture of Multinomials
Name ______________________
Date _______________________
100 points total:
______ of 10 Discussion of your cluster initialization method
______ of 10 Discussion of your approach to feature selection and document vector representation
______ of 40 Implementation of Expectation Maximization (EM) and Classification EM (CEM)
______ of 10 Discussion and experimental results, including metrics, on the Reuters data set
______ of 10 Discussion and experimental results, including metrics, on a second data set
(subset of 20 Newsgroups, or subset of del.icio.us)
______ of 20 Qualitative analysis of best clustering on one data set (your choice)
Total:
______ of 100
Other Feedback:
Notes:
______ Early credit earned on this project
______ Late days used on this project
______ Total late days remaining as of the grading of this project
Acknowledgments
Thanks to Dan Klein of U.C. Berkeley for much of the code in the edu.berkeley hierarchy. Thanks to our own Dan Walker for most of the rest of the code.
