James PustejovskyAmber Stubbs

How to Develop Language Annotations for Machine Learning Algorithms

Date: This event took place live on October 16 2012

Presented by: James Pustejovsky, Amber Stubbs

Duration: Approximately 60 minutes.

Cost: Free

Text-based data mining and information extraction systems that make use of machine learning techniques require annotated datasets for training the algorithms. In this webcast presented by James Pustejovsky and Amber Stubbs, we will discuss the steps involved in creating your own training corpus for such machine learning algorithms. We walk you through:

  • The annotation cycle
  • Selecting an annotation task
  • Creating the annotation specification
  • Designing the guidelines
  • Creating a "gold standard" corpus
  • Beginning the actual data creation with the annotation process

We then mention the most relevant machine learning algorithms for natural language data and tasks, and provide hints for how to choose the right one for your learning task and your own dataset.

Finally, we discuss testing and evaluation of the algorithm, along with suggestions for how to revise your system depending on the resulting performance. This is a unique, up-close, step-by-step look at the entire development cycle for NLP system design, from your initial idea, to spec, through annotation and corpus development, to training and testing your algorithm. Don't miss this informative webcast.

About James Pustejovsky

James Pustejovsky holds the TJX/Felberg Chair in Computer Science at Brandeis University, where he directs the Lab for Linguistics and Computation, and chairs both the Program in Language and Linguistics and the Computational Linguistics MA Program. He has conducted research in computational linguistics, AI, lexical semantics, temporal reasoning, and corpus linguistics and language annotation. He is currently head of a working group within ISO/TC37/SC4 to develop a Semantic Annotation Framework, and is the author of the recently approved ISO specification for time annotation (SemAF-Time, ISO-TimeML) and the draft specification for space annotation (SemAF-Space, ISO-Space). Pustejovsky was PI of a large NSF-funded effort, "Towards a Comprehensive Linguistic Annotation of Language," that involved merging several diverse linguistic annotations (PropBank, NomBank, the Discourse Treebank, TimeBank, and Opinion Corpus) into a unified representation. Currently, he is Co-PI of a major project funded by the NSF to address interoperability for NLP data and tools. He has taught computational linguistics to both graduates and undergraduates for 20 years, and corpus linguistics for eight years.

About Amber Stubbs

Amber Stubbs recently completed her Ph.D. in Computer Science at Brandeis University, and is currently a Postdoctoral Associate at SUNY Albany. Her dissertation focused on creating an annotation methodology to aid in extracting high-level information from natural language files, particularly biomedical texts. Her website can be found at http://pages.cs.brandeis.edu/~astubbs/

Questions? Please send email to