book

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

October 2012

Beginner to intermediate

342 pages

9h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Natural Language Annotation for Machine LearningAudienceOrganization of This BookSoftware RequirementsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsJames Adds:Amber Adds:
1. The Basics
The Importance of Language AnnotationThe Layers of Linguistic DescriptionWhat Is Natural Language Processing?A Brief History of Corpus LinguisticsWhat Is a Corpus?Early Use of CorporaCorpora TodayKinds of AnnotationLanguage Data and Machine LearningClassificationClusteringStructured Pattern InductionThe Annotation Development CycleModel the PhenomenonAnnotate with the SpecificationTrain and Test the Algorithms over the CorpusEvaluate the ResultsRevise the Model and AlgorithmsSummary
2. Defining Your Goal and Dataset
Defining Your GoalThe Statement of PurposeRefining Your Goal: Informativity Versus CorrectnessThe scope of the annotation taskWhat will the annotation be used for?What will the overall outcome be?Where will the corpus come from?How will the result be achieved?Background ResearchLanguage ResourcesOrganizations and ConferencesNLP ChallengesAssembling Your DatasetThe Ideal Corpus: Representative and BalancedCollecting Data from the InternetEliciting Data from PeopleRead speechSpontaneous speechThe Size of Your CorpusExisting CorporaDistributions Within CorporaSummary
3. Corpus Analytics
Basic Probability for Corpus AnalyticsJoint Probability DistributionsBayes RuleCounting OccurrencesZipf’s LawN-gramsLanguage ModelsSummary
4. Building Your Model and Specification
Some Example Models and SpecsFilm Genre ClassificationAdding Named EntitiesSemantic RolesAdopting (or Not Adopting) Existing ModelsCreating Your Own Model and Specification: Generality Versus SpecificityUsing Existing Models and SpecificationsUsing Models Without SpecificationsDifferent Kinds of StandardsISO StandardsAnnotation format standardsAnnotation specification standardsCommunity-Driven StandardsOther Standards Affecting AnnotationSummary
5. Applying and Adopting Annotation Standards
Metadata Annotation: Document ClassificationUnique Labels: Movie ReviewsMultiple Labels: Film GenresText Extent Annotation: Named EntitiesInline AnnotationStand-off Annotation by TokensStand-off Annotation by Character LocationLinked Extent Annotation: Semantic RolesISO Standards and YouSummary
6. Annotation and Adjudication
The Infrastructure of an Annotation ProjectSpecification Versus GuidelinesBe Prepared to RevisePreparing Your Data for AnnotationMetadataPreprocessed DataSplitting Up the Files for AnnotationWriting the Annotation GuidelinesExample 1: Single Labels—Movie ReviewsExample 2: Multiple Labels—Film GenresExample 3: Extent Annotations—Named EntitiesExample 4: Link Tags—Semantic RolesAnnotatorsChoosing an Annotation EnvironmentEvaluating the AnnotationsCohen’s Kappa (κ)Fleiss’s Kappa (κ)Interpreting Kappa CoefficientsCalculating κ in Other ContextsCreating the Gold Standard (Adjudication)Summary
7. Training: Machine Learning
What Is Learning?Defining Our Learning TaskClassifier AlgorithmsDecision Tree LearningGender IdentificationNaïve Bayes LearningMovie genre identificationSentiment classificationMaximum Entropy ClassifiersOther Classifiers to Know AboutSequence Induction AlgorithmsClustering and Unsupervised LearningSemi-Supervised LearningMatching Annotation to AlgorithmsSummary
8. Testing and Evaluation
Testing Your AlgorithmEvaluating Your AlgorithmConfusion MatricesCalculating Evaluation ScoresPercentage accuracyPrecision and recallF-measureOther evaluation metricsInterpreting Evaluation ScoresProblems That Can Affect EvaluationDataset Is Too SmallAlgorithm Fits the Development Data Too WellToo Much Information in the AnnotationFinal Testing ScoresSummary
9. Revising and Reporting
Revising Your ProjectCorpus Distributions and ContentModel and SpecificationAnnotationGuidelinesAnnotatorsToolsTraining and TestingReporting About Your WorkAbout Your CorpusAbout Your Model and SpecificationsAbout Your Annotation Task and AnnotatorsAbout Your ML AlgorithmAbout Your RevisionsSummary

10. Annotation: TimeML
The Goal of TimeMLRelated ResearchBuilding the CorpusModel: Preliminary SpecificationsTimesSignalsEventsLinksAnnotation: First AttemptsModel: The TimeML Specification Used in TimeBankTime ExpressionsEventsSignalsLinksConfidenceAnnotation: The Creation of TimeBankTimeML Becomes ISO-TimeMLModeling the Future: Directions for TimeMLNarrative ContainersExpanding TimeML to Other DomainsEvent StructuresSummary
11. Automatic Annotation: Generating TimeML
The TARSQI ComponentsGUTime: Temporal Marker IdentificationEVITA: Event Recognition and ClassificationGUTenLINKSlinketSputLinkMachine Learning in the TARSQI ComponentsImprovements to the TTKStructural ChangesImprovements to Temporal Entity Recognition: BTimeTemporal Relation IdentificationTemporal Relation ValidationTemporal Relation VisualizationTimeML Challenges: TempEval-2TempEval-2: System SummariesOverview of ResultsFuture of the TTKNew Input FormatsNarrative Containers/Narrative TimesMedical DocumentsCross-Document AnalysisSummary
12. Afterword: The Future of Annotation
Crowdsourcing AnnotationAmazon’s Mechanical TurkGames with a Purpose (GWAP)User-Generated ContentHandling Big DataBoostingActive LearningSemi-Supervised LearningNLP Online and in the CloudDistributed ComputingShared Language ResourcesShared Language ApplicationsAnd Finally...
A. List of Available Corpora and Specifications
CorporaSpecifications, Guidelines, and Other ResourcesRepresentation Standards
B. List of Software Resources
Annotation and Adjudication SoftwareMultipurpose ToolsCorpus Creation and Exploration ToolsManual Annotation ToolsAutomated Annotation ToolsMultipurpose toolsPhonetic annotationPart-of-speech taggers/syntactic parsersTokenizers/chunkers/stemmersOtherMachine Learning Resources
C. MAE User Guide
Installing and Running MAELoading Tasks and FilesLoading a TaskLoading a FileAnnotating EntitiesAttribute informationNonconsuming tagsAnnotating LinksDeleting TagsSaving FilesDefining Your Own TaskTask NameElements (a.k.a. Tags)Attributesid attributesstart attributeAttribute typesDefault attribute valuesFrequently Asked Questions
D. MAI User Guide
Installing and Running MAILoading Tasks and FilesLoading a TaskLoading FilesAdjudicatingThe MAI WindowAdjudicating a TagExtent TagsLink TagsNonconsuming TagsAdding New TagsDeleting tagsSaving Files
E. Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright

Content preview from Natural Language Annotation for Machine Learning

Chapter 11. Automatic Annotation: Generating TimeML

As you can see from the preceding chapter, modeling events, times, and their temporal relationships in an annotation is a large and complicated task. In this chapter we will discuss the TARSQI Toolkit, as well as other systems that were created to generate TimeML as part of the TempEval-2 challenge held in 2010. In this chapter, we will:

Discuss how a complicated annotation can be broken down into different components for easier processing
Provide an in-depth discussion of the first attempt to create a system for creating TimeML
Show examples of how that system has been improved over the years
Explain the approaches taken by other examples of systems designed to create TimeML
Discuss the differences between rule-based and machine learning (ML) systems for complex annotation tasks
Provide examples of ways that the TARSQI Toolkit could be expanded in the future

Overall, in this chapter we won’t be going into detail about how each aspect of TimeML was automated; rather, we will provide a breakdown of how the task was approached, and give a sense of some of the different options available for tackling a complicated annotation.

Note

The TARSQI Toolkit is not the creation of a single person, and we would like to acknowledge all of the people who have contributed to its creation and improvement (in alphabetical order): Alex Baron, Russell Entrikin, Catherine Havasi, Jerry Hobbs, Seo-Hyun Im, Seok Bae Jang, Bob Knippen, Inderjeet Mani, Jessica ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781449332693Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

Chapter 11. Automatic Annotation: Generating TimeML

Note

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.