Projects:ALFA

From NLPWiki

Jump to: navigation, search

Active Learning for Annotation

Contents

About the project

Linguistically annotated corpora have proven useful in many applications in Natural Language Processing and in the Humanities. Resources are lacking for extensive human annotation, so automatic annotation is often required. What should we do if we have insufficient annotated data from which to train an automatic annotator? Members of the ALFA project are implementing a system that relies on minimal amounts of hand-annotated data provided by human experts in the framework of active learning. Active learning invites human experts to help the system improve its annotation ability by labeling data deemed especially useful by the system. The easy cases are typically left to the machine. Furthermore, the learned model improves substantially with additional examples. One of our primary contributions has been to make the active learner sensitive to the predicted cost of annotation incurred by the expert.

Publications

Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation

  • Eric Ringger, Peter McClanahan, Robbie Haertel, George Busby, Marc Carmen, James Carroll, Kevin Seppi, Deryle Lonsdale
  • ACL 2007 Linguistic Annotation Workshop (LAW)
In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.

Modeling the Annotation Process for Ancient Corpus Creation

  • James L. Carroll, Robbie Haertel, Peter McClanahan, Eric Ringger, Kevin Seppi
  • Electronic Corpora of Ancient Languages 2007
The ideas in this paper arose from a project to develop an electronic corpus and concordance of ancient Syriac literature. We will use this project to illustrate many of the ideas in this paper. The Syriac project at Brigham Young University involves many individuals from several departments including Linguistics, Computer Science, and the Center for the Preservation of Ancient Religious Texts. The project team also includes scholars from Oxford and Princeton Universities. Syriac texts have been transcribed manually by teams of Maronite, West Syrian and East Syrian Christians and Monks located in Lebanon, Rome, Iraq, Chicago and Oxford. The proximate goal of this project is to produce a corpus tagged with part of speech data for the writings of the fourth century Syriac poet-theologian Ephrem the Syrian (d. 373). This initial corpus is approximately half a million words in size. A further four million words have been added to the corpus in draft format. These texts originate from the third to the thirteenth century. However the majority of the texts are from the fourth to the seventh centuries, the so called Classical period of Syriac literature. It is the long-term aim of the project to build a comprehensive corpus of Syriac literature, working diachronically through the available texts. Much of Syriac literature has already been published, and these published texts are used in the corpus. However, a great deal of Syriac literature is available only in manuscripts. It is impossible to precisely estimate the size of the corpus; however, it is not improbable that the corpus extends to over 30,000,000 words.
We do not have the resources to fully annotate a corpus of this size with morphological tags. We are taking a pragmatic approach to annotating texts for the corpus. The first stage is to prepare a draft transcription with machine annotation. Texts will then be proofread and annotated by hand as scholarly interest is raised to a sufficiently high level to complete the work. Many texts in the corpus may never be fully proofread or annotated. Some text collections, beginning with Ephrem, will, however, be thoroughly proofed and tagged, sufficient to produce a full print concordance. A higher level of accuracy will be required for the print portion of the corpus than for the remainder of the corpus which will be published on the internet. (James L. Carroll et al.)
The production of electronic corpora for ancient languages involves several “annotation” tasks. Transcription, morphological and part of speech tagging, grammatical parsing, and semantic tagging can all be seen as annotation tasks. For example, in transcription the user takes an image and labels (or annotates) the image with transcribed text. In part-of-speech tagging the user takes a transcribed text and annotates the text with parts of speech etc. Thus annotation is central to each step in the creation of a useful electronic corpus. The goal of our part of the Syriac literature project is to reduce human annotation cost as much as possible through the appropriate use of machine learning and active learning techniques. We also seek to achieve lower error rates than could be achieved through human annotation alone and to appropriately balance the value of annotator time on the print corpus with the value of annotator time on the internet corpus.

Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study

  • Eric Ringger, Marc Carmen, Robbie Haertel, Kevin Seppi, Deryle Lonsdale, Peter McClanahan, James Carroll, Noel Ellison
  • LREC 2008
Fixed, limited budgets often constrain the amount of expert annotation that can go into the construction of annotated corpora. Estimating the cost of annotation is the first step toward using annotation resources wisely. We present here a study of the cost of annotation. This study includes the participation of annotators at various skill levels and with varying backgrounds. Conducted over the web, the study consists of tests that simulate machine-assisted pre-annotation, requiring correction by the annotator rather than annotation from scratch. The study also includes tests representative of an annotation scenario involving Active Learning as it progresses from a naïve model to a knowledgeable model; in particular, annotators encounter pre-annotation of varying degrees of accuracy. The annotation interface lists tags considered likely by the annotation model in preference to other tags. We present the experimental parameters of the study and report both descriptive and inferential statistics on the results of the study. We conclude with a model for estimating the hourly cost of annotation for annotators of various skill levels. We also present models for two granularities of annotation: sentence at a time and word at a time.

Assessing the Costs of Sampling Methods in Active Learning for Annotation

  • Robbie Haertel, Eric Ringger, Kevin Seppi, James Carroll, Peter McClanahan
  • ACL 2008
Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when annotating sequences; some sequences will take longer than others. We show that the AL technique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation user study, we approximate the amount of time necessary to annotate a given sentence. This model allows us to evaluate the effectiveness of AL sampling methods in terms of time spent in annotation. We achieve a 77% reduction in hours from a random baseline to achieve 96.5% tag accuracy on the Penn Treebank. More significantly, we make the case for measuring cost in assessing AL methods.

Return on Investment for Active Learning

  • Robbie A. Haertel, Kevin D. Seppi, Eric K. Ringger, James L. Carroll
  • NIPS 2008 Workshop on Cost-Sensitive Learning
Active Learning (AL) can be defined as a selectively supervised learning protocol intended to present those data to an oracle for labeling which will be most enlightening for machine learning. While AL traditionally accounts for the value of the information obtained, it often ignores the cost of obtaining the information thus causing it to perform sub-optimally with respect to total cost. We present a framework for AL that accounts for this cost and discuss optimality and tractability in this framework. Using this framework we motivate Return On Investment (ROI), a practical, cost-sensitive heuristic that can be used to convert existing algorithms into cost-conscious active learners. We demonstrate the validity of ROI in a simulated AL part-of-speech tagging task on the Penn Treebank in which ROI achieves as high as a 73% reduction in hourly cost over random selection.

NAACL HLT 2009 Workshop on Active Learning for NLP

  • Organized by: Eric Ringger, Robbie Haertel, Katrin Tomanek


Questions?

Please contact Eric Ringger Kevin Seppi or visit the Natural Language Processing research lab in room 3346 TMCB.

Personal tools