book

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

October 2012

Beginner to intermediate

342 pages

9h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Natural Language Annotation for Machine LearningAudienceOrganization of This BookSoftware RequirementsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsJames Adds:Amber Adds:
1. The Basics
The Importance of Language AnnotationThe Layers of Linguistic DescriptionWhat Is Natural Language Processing?A Brief History of Corpus LinguisticsWhat Is a Corpus?Early Use of CorporaCorpora TodayKinds of AnnotationLanguage Data and Machine LearningClassificationClusteringStructured Pattern InductionThe Annotation Development CycleModel the PhenomenonAnnotate with the SpecificationTrain and Test the Algorithms over the CorpusEvaluate the ResultsRevise the Model and AlgorithmsSummary
2. Defining Your Goal and Dataset
Defining Your GoalThe Statement of PurposeRefining Your Goal: Informativity Versus CorrectnessThe scope of the annotation taskWhat will the annotation be used for?What will the overall outcome be?Where will the corpus come from?How will the result be achieved?Background ResearchLanguage ResourcesOrganizations and ConferencesNLP ChallengesAssembling Your DatasetThe Ideal Corpus: Representative and BalancedCollecting Data from the InternetEliciting Data from PeopleRead speechSpontaneous speechThe Size of Your CorpusExisting CorporaDistributions Within CorporaSummary
3. Corpus Analytics
Basic Probability for Corpus AnalyticsJoint Probability DistributionsBayes RuleCounting OccurrencesZipf’s LawN-gramsLanguage ModelsSummary
4. Building Your Model and Specification
Some Example Models and SpecsFilm Genre ClassificationAdding Named EntitiesSemantic RolesAdopting (or Not Adopting) Existing ModelsCreating Your Own Model and Specification: Generality Versus SpecificityUsing Existing Models and SpecificationsUsing Models Without SpecificationsDifferent Kinds of StandardsISO StandardsAnnotation format standardsAnnotation specification standardsCommunity-Driven StandardsOther Standards Affecting AnnotationSummary
5. Applying and Adopting Annotation Standards
Metadata Annotation: Document ClassificationUnique Labels: Movie ReviewsMultiple Labels: Film GenresText Extent Annotation: Named EntitiesInline AnnotationStand-off Annotation by TokensStand-off Annotation by Character LocationLinked Extent Annotation: Semantic RolesISO Standards and YouSummary
6. Annotation and Adjudication
The Infrastructure of an Annotation ProjectSpecification Versus GuidelinesBe Prepared to RevisePreparing Your Data for AnnotationMetadataPreprocessed DataSplitting Up the Files for AnnotationWriting the Annotation GuidelinesExample 1: Single Labels—Movie ReviewsExample 2: Multiple Labels—Film GenresExample 3: Extent Annotations—Named EntitiesExample 4: Link Tags—Semantic RolesAnnotatorsChoosing an Annotation EnvironmentEvaluating the AnnotationsCohen’s Kappa (κ)Fleiss’s Kappa (κ)Interpreting Kappa CoefficientsCalculating κ in Other ContextsCreating the Gold Standard (Adjudication)Summary
7. Training: Machine Learning
What Is Learning?Defining Our Learning TaskClassifier AlgorithmsDecision Tree LearningGender IdentificationNaïve Bayes LearningMovie genre identificationSentiment classificationMaximum Entropy ClassifiersOther Classifiers to Know AboutSequence Induction AlgorithmsClustering and Unsupervised LearningSemi-Supervised LearningMatching Annotation to AlgorithmsSummary
8. Testing and Evaluation
Testing Your AlgorithmEvaluating Your AlgorithmConfusion MatricesCalculating Evaluation ScoresPercentage accuracyPrecision and recallF-measureOther evaluation metricsInterpreting Evaluation ScoresProblems That Can Affect EvaluationDataset Is Too SmallAlgorithm Fits the Development Data Too WellToo Much Information in the AnnotationFinal Testing ScoresSummary
9. Revising and Reporting
Revising Your ProjectCorpus Distributions and ContentModel and SpecificationAnnotationGuidelinesAnnotatorsToolsTraining and TestingReporting About Your WorkAbout Your CorpusAbout Your Model and SpecificationsAbout Your Annotation Task and AnnotatorsAbout Your ML AlgorithmAbout Your RevisionsSummary

10. Annotation: TimeML
The Goal of TimeMLRelated ResearchBuilding the CorpusModel: Preliminary SpecificationsTimesSignalsEventsLinksAnnotation: First AttemptsModel: The TimeML Specification Used in TimeBankTime ExpressionsEventsSignalsLinksConfidenceAnnotation: The Creation of TimeBankTimeML Becomes ISO-TimeMLModeling the Future: Directions for TimeMLNarrative ContainersExpanding TimeML to Other DomainsEvent StructuresSummary
11. Automatic Annotation: Generating TimeML
The TARSQI ComponentsGUTime: Temporal Marker IdentificationEVITA: Event Recognition and ClassificationGUTenLINKSlinketSputLinkMachine Learning in the TARSQI ComponentsImprovements to the TTKStructural ChangesImprovements to Temporal Entity Recognition: BTimeTemporal Relation IdentificationTemporal Relation ValidationTemporal Relation VisualizationTimeML Challenges: TempEval-2TempEval-2: System SummariesOverview of ResultsFuture of the TTKNew Input FormatsNarrative Containers/Narrative TimesMedical DocumentsCross-Document AnalysisSummary
12. Afterword: The Future of Annotation
Crowdsourcing AnnotationAmazon’s Mechanical TurkGames with a Purpose (GWAP)User-Generated ContentHandling Big DataBoostingActive LearningSemi-Supervised LearningNLP Online and in the CloudDistributed ComputingShared Language ResourcesShared Language ApplicationsAnd Finally...
A. List of Available Corpora and Specifications
CorporaSpecifications, Guidelines, and Other ResourcesRepresentation Standards
B. List of Software Resources
Annotation and Adjudication SoftwareMultipurpose ToolsCorpus Creation and Exploration ToolsManual Annotation ToolsAutomated Annotation ToolsMultipurpose toolsPhonetic annotationPart-of-speech taggers/syntactic parsersTokenizers/chunkers/stemmersOtherMachine Learning Resources
C. MAE User Guide
Installing and Running MAELoading Tasks and FilesLoading a TaskLoading a FileAnnotating EntitiesAttribute informationNonconsuming tagsAnnotating LinksDeleting TagsSaving FilesDefining Your Own TaskTask NameElements (a.k.a. Tags)Attributesid attributesstart attributeAttribute typesDefault attribute valuesFrequently Asked Questions
D. MAI User Guide
Installing and Running MAILoading Tasks and FilesLoading a TaskLoading FilesAdjudicatingThe MAI WindowAdjudicating a TagExtent TagsLink TagsNonconsuming TagsAdding New TagsDeleting tagsSaving Files
E. Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright

Content preview from Natural Language Annotation for Machine Learning

Chapter 6. Annotation and Adjudication

Now that you have a corpus and a model, it’s time to start looking at the actual annotation process—the “A” in the MATTER cycle. Here is where you define the method by which your model is applied to your texts, both in theory (how your task is described to annotators) and in practice (what software and other tools are used to create the annotations). A critical part of this stage is adjudication—where you take your annotators’ work and use it to create the gold standard corpus that you will use for machine learning. In this chapter we will answer the following questions:

What are the components of an annotation task?
What is the difference between a model specification and annotation guidelines?
How do you create guidelines that fit your task?
What annotation tool should you use for your annotation task?
What skills do your annotators need to create your annotations?
How can you tell (qualitatively) if your annotation guidelines are good for your task?
What is involved in adjudicating the annotations?

The Infrastructure of an Annotation Project

It’s much easier to write annotation guidelines when you understand how annotation projects are usually run, so before getting into the details of guideline writing, we’re going to go over a few different ways that you can structure your annotation effort.

Currently, what we would call the “traditional” approach goes like this. Once a schema is developed and a corpus is collected, an investigator writes ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781449332693Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

Chapter 6. Annotation and Adjudication

The Infrastructure of an Annotation Project

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.