Readings
From NLPWiki
Week 1: Text Classification with Naive Bayes
- "A Comparison of Event Models for Naive Bayes Text Classification", by Andrew McCallum and Kamal Nigam. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. Technical Report WS-98-05. AAAI Press. 1998. PDF.
- (optional) "Naive Bayes Text Classification: A Statistical Natural Language Processing Project", by Chris Monson media:Chris_Monson.pdf.
Week 2: Semi-Supervised Learning with Naive Bayes and Expectation Maximization
- "Learning to Classify Text from Labeled and Unlabeled Documents", by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. PDF (8 pages)
- (optional) "Text Classification from Labeled and Unlabeled Documents using EM", by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, 39(2/3). pp. 103-134. 2000. PDF (34 pages)
Week 3: Text Classification with Maximum Entropy
- "Using Maximum Entropy for Text Classification", by Kamal Nigam, John Lafferty, Andrew McCallum. PDF (7 pages)
- (optional) "A Maximum Entropy Approach to Natural Language Processing", by Adam Berger, Vincent Della Pietra, Stephen Della Pietra. PDF (34 pages)
Week 4: Feature Selection
- Mutual information and Log-Likelihood ratio sections in Manning & Schuetze: 5.1-5.4
- (optional) "A comparative study on feature selection for text categorization", by Yiming Yang and Jan Pedersen. PDF
Week 5: Feature Selection in the Learning Loop
- Focus on the section 4 about feature selection in the learning loop: "A Maximum Entropy Approach to Natural Language Processing", by Adam Berger, Vincent Della Pietra, Stephen Della Pietra. PDF
Week 6: Feature Selection as Word Clustering
- "Distributional Clustering of Words for Text Classification", by Douglas Baker and Andrew McCallum. PDF
Week 7: Text Classification with Support Vector Machines
- Work through as much of the SVM Tutorial by Nello Cristianini as you can. I don't expect you to get all the way through this. Presentation slides from ICML 2001 Tutorial: PDF
- "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", by Thorsten Joachims. PDF
Moving on to text clustering ...
Weeks 8 & 9: Clustering with Naive Bayes
- "An Experimental Comparison of Several Clustering and Initialization Methods", by Marina Meila and David Heckerman. Try to fight through the whole thing. PS
Week 10: Bayesian Smoothing
- "Bayesian smoothing through text classification", by Tom Griffiths.[3]
Week 11: Going Beyond Naive Bayes
- "Latent Dirichlet Allocation", by D. Blei, A. Ng, and M. Jordan. This is dense. Read as much of this as you can. PDF
- Blei's code is also available here: [4]
Extra reading:
Clustering Email
- "Inferring Ongoing Activities of Workstation Users by Clustering Email". PDF
Shorter version: PDF
- "Automatic Discovery of Personal Topics To Organize Email".
PDF by Arun C. Surendran, John C. Platt and Erin Renshaw, Conference on Email and Anti-Spam, 21-22 July at Stanford University, 2005.
- "Restrictive Clustering and Metaclustering for Self-Organizing Document Collections". [5]
