CS601R:Project 5 Guidelines
From NLPWiki
Back to CS601R Main Page
Contents |
Project #5: Topic Identification
Deadlines
- Early Day for Proposal: 3/21/08
- Due Date for Proposal: 3/24/08
- Early Day for Project Report: 4/7/08
- Due Date for Project Report: 4/8/08
- Presentations: 4/9/08, 4/11/08, 4/14/08
Objectives
This assignment is designed to:
- give you an opportunity to either implement a topic identification algorithm using a technique of your choice from the literature OR apply an existing topic identification algorithm in a compelling application (subject to instructor approval of your proposal)
- involve you in the quantitative evaluation of your topic identification algorithm/application
- help you conduct a qualitative assessment of the results of your project in order to understand the degree to which your project is successful on two different data sets
- introduce a new data set: the Enron email corpus
- serve as a springboard for a project you could continue beyond this class and publish in a top tier conference
Setup
1. Consult the list of readings on the course schedule.
- Option A: select a published topic identification algorithm you would like to implement for this project. You are also welcome to introduce a novel idea as an extension of a published algorithm.
- Option B: select a topic identification algorithm for which there is an existing implementation. Formulate a compelling application involving the topic ID technique as an important component.
2. Prepare a brief proposal (no more than one page) summarizing the algorithm or application, potential advantages of the technique, possible ways to evaluate your technique, your sources, and any other information you think will clarify your thinking so that you can get to implementation quickly. Submit the proposal early for an easy early bonus!
3. Preferably, you will implement your project in the class codebase, so that we can make further use of your algorithm in the future. If you want an exception to this requirement, please consult with the instructor. You should have a working version of the class codebase as a result of your work on the earlier projects. If necessary, consult the following directions to get going:
4. You will need a copy of two data sets from the following list: the reduced Reuters data set, the (labeled subset of the) Enron data set, the reduced set of 20 Newsgroups, or the reduced del.icio.us set. Remember: You are not allowed to redistribute the Reuters data.
5. As necessary, retrieve the latest copy of the data from the following URLs into directories adjacent to your code, and follow the directions given on the above wiki page to unpack the data directory. Note that the data directories now require authentication.
6. For this project, you will submit all of your code.
Background
Topic identification is about the discovery of topics in documents. Among the possible techniques for topic identification are LSA, pLSI, LDA, and the variations on LDA to be found in the literature and to be discussed in class. Applications include dimensionality reduction, text mining, automatic tagging (like the manual tagging in Gmail and blogs), etc.
Propose something realistically doable in the two weeks you have to work on this. Your instructor and TA will give you quick advice on the proposal to help you succeed. Once you have the green light on your proposal, be sure to incorporate the feedback given you in your final project.
You have a lot of freedom in how to proceed with this project. Be sure to make reasonable design choices, describe your choices, and advocate for them. In your project report, be sure to document your algorithms, your sources, your initialization methods, and any other insights you think will help to communicate the approach you have implemented.
Data
In this project, you will work with two data sets. Choose from the four available data sets. The Enron email data set is the newest addition. As you may recall, the entire Enron data set is a large corpus of email that was subpoenaed by the Federal Energy Regulatory Commission (FERC) in the wake of the Enron collapse. The subset provided to you has been labeled, although you will not need the labels. The second set is 1990s news data from Reuters; you are not allowed to redistribute this data per the license terms. Third, you may also work with the given reduced set of the 20 Newsgroups data. Finally, you may work with the del.icio.us data set from our previous projects. You choose which of those two you want to work with. Your project should not look at the labels of any of these data sets.
Quantitative Evaluation
You should define and implement at least one sensible quantitative metric for evaluating your project. You may find that some of the ideas behind our clustering metrics are reusable for your purposes. You should also justify the metric and convince your readers why your metric should be trusted. You will use the metric(s) to evaluate your topic clustering technique or the application you have implemented. Do so on both of the data sets you have selected. Report your results and discuss the implications.
For randomized algorithms (like EM), be sure to run at least three times and report averages, as we have discussed in class.
Qualitative Evaluation
You should engage in qualitative evaluation on one of the data sets you employ. Qualitative evaluation aims to answer the question “are the results of this program any good?” In other words, “would this be useful to someone?” Your approach to qualitative evaluation should go considerably beyond a casual inspection or anecdotal evidence. Find a way to characterize trends and patterns.
If you are evaluating a topic identification technique, then report the top N words according to P(w | z), where z is a topic.
You might also answer the following kinds of questions:
- Q: To what degree are the topics internally coherent?
- Q: To what degree are the topics distinct from one another?
- Q: Can you give a name to each topic, based on your inspection?
- Q: What clues lead you to these conclusions?
Finally, for the data set you are focusing on here, consider the quantitative metric together with your qualitative assessment:
- Q: What trends do you observe?
- Q: To what degree does metric with your qualitative inspection of the clusters?
- Q: To what degree do the metrics correlate with one another?
Report
Turn in a clear, well-structured report that discusses your implementations, describes your experiments with appropriate tables or graphs and accompanying interpretation, includes examples, and shares any other insights you think will be helpful in interpreting your work. There is no set length requirement, but I estimate a ball-park of 5-6 pages for the report. Your report will be graded based on the rubric presented on the following page. You will also present your work to the class on the final days of class.
Rubric
Project #5: Topic Identification Name ______________________ Date _______________________ 100 points total: ______ of 25 Pseudo-code and discussion of your topic identification algorithm or application; examples of the performance of the algorithm/application ______ of 20 Detailed explanation of your quantitative metric ______ of 15 Experimental results from your algorithm on chosen data set #1 ______ of 15 Experimental results from your algorithm on chosen data set #2 ______ of 15 Qualititative evaluation of your project and discussion of relevant questions ______ of 10 Submission of code for your algorithm with instructions for running an experiment on the data sets Total: ______ of 100 Other Feedback: Notes: ______ Early credit earned on this project ______ Late days used on this project ______ Total late days remaining as of the grading of this project
