book

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

October 2012

Beginner to intermediate

342 pages

9h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Natural Language Annotation for Machine LearningAudienceOrganization of This BookSoftware RequirementsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsJames Adds:Amber Adds:
1. The Basics
The Importance of Language AnnotationThe Layers of Linguistic DescriptionWhat Is Natural Language Processing?A Brief History of Corpus LinguisticsWhat Is a Corpus?Early Use of CorporaCorpora TodayKinds of AnnotationLanguage Data and Machine LearningClassificationClusteringStructured Pattern InductionThe Annotation Development CycleModel the PhenomenonAnnotate with the SpecificationTrain and Test the Algorithms over the CorpusEvaluate the ResultsRevise the Model and AlgorithmsSummary
2. Defining Your Goal and Dataset
Defining Your GoalThe Statement of PurposeRefining Your Goal: Informativity Versus CorrectnessThe scope of the annotation taskWhat will the annotation be used for?What will the overall outcome be?Where will the corpus come from?How will the result be achieved?Background ResearchLanguage ResourcesOrganizations and ConferencesNLP ChallengesAssembling Your DatasetThe Ideal Corpus: Representative and BalancedCollecting Data from the InternetEliciting Data from PeopleRead speechSpontaneous speechThe Size of Your CorpusExisting CorporaDistributions Within CorporaSummary
3. Corpus Analytics
Basic Probability for Corpus AnalyticsJoint Probability DistributionsBayes RuleCounting OccurrencesZipf’s LawN-gramsLanguage ModelsSummary
4. Building Your Model and Specification
Some Example Models and SpecsFilm Genre ClassificationAdding Named EntitiesSemantic RolesAdopting (or Not Adopting) Existing ModelsCreating Your Own Model and Specification: Generality Versus SpecificityUsing Existing Models and SpecificationsUsing Models Without SpecificationsDifferent Kinds of StandardsISO StandardsAnnotation format standardsAnnotation specification standardsCommunity-Driven StandardsOther Standards Affecting AnnotationSummary
5. Applying and Adopting Annotation Standards
Metadata Annotation: Document ClassificationUnique Labels: Movie ReviewsMultiple Labels: Film GenresText Extent Annotation: Named EntitiesInline AnnotationStand-off Annotation by TokensStand-off Annotation by Character LocationLinked Extent Annotation: Semantic RolesISO Standards and YouSummary
6. Annotation and Adjudication
The Infrastructure of an Annotation ProjectSpecification Versus GuidelinesBe Prepared to RevisePreparing Your Data for AnnotationMetadataPreprocessed DataSplitting Up the Files for AnnotationWriting the Annotation GuidelinesExample 1: Single Labels—Movie ReviewsExample 2: Multiple Labels—Film GenresExample 3: Extent Annotations—Named EntitiesExample 4: Link Tags—Semantic RolesAnnotatorsChoosing an Annotation EnvironmentEvaluating the AnnotationsCohen’s Kappa (κ)Fleiss’s Kappa (κ)Interpreting Kappa CoefficientsCalculating κ in Other ContextsCreating the Gold Standard (Adjudication)Summary
7. Training: Machine Learning
What Is Learning?Defining Our Learning TaskClassifier AlgorithmsDecision Tree LearningGender IdentificationNaïve Bayes LearningMovie genre identificationSentiment classificationMaximum Entropy ClassifiersOther Classifiers to Know AboutSequence Induction AlgorithmsClustering and Unsupervised LearningSemi-Supervised LearningMatching Annotation to AlgorithmsSummary
8. Testing and Evaluation
Testing Your AlgorithmEvaluating Your AlgorithmConfusion MatricesCalculating Evaluation ScoresPercentage accuracyPrecision and recallF-measureOther evaluation metricsInterpreting Evaluation ScoresProblems That Can Affect EvaluationDataset Is Too SmallAlgorithm Fits the Development Data Too WellToo Much Information in the AnnotationFinal Testing ScoresSummary
9. Revising and Reporting
Revising Your ProjectCorpus Distributions and ContentModel and SpecificationAnnotationGuidelinesAnnotatorsToolsTraining and TestingReporting About Your WorkAbout Your CorpusAbout Your Model and SpecificationsAbout Your Annotation Task and AnnotatorsAbout Your ML AlgorithmAbout Your RevisionsSummary

10. Annotation: TimeML
The Goal of TimeMLRelated ResearchBuilding the CorpusModel: Preliminary SpecificationsTimesSignalsEventsLinksAnnotation: First AttemptsModel: The TimeML Specification Used in TimeBankTime ExpressionsEventsSignalsLinksConfidenceAnnotation: The Creation of TimeBankTimeML Becomes ISO-TimeMLModeling the Future: Directions for TimeMLNarrative ContainersExpanding TimeML to Other DomainsEvent StructuresSummary
11. Automatic Annotation: Generating TimeML
The TARSQI ComponentsGUTime: Temporal Marker IdentificationEVITA: Event Recognition and ClassificationGUTenLINKSlinketSputLinkMachine Learning in the TARSQI ComponentsImprovements to the TTKStructural ChangesImprovements to Temporal Entity Recognition: BTimeTemporal Relation IdentificationTemporal Relation ValidationTemporal Relation VisualizationTimeML Challenges: TempEval-2TempEval-2: System SummariesOverview of ResultsFuture of the TTKNew Input FormatsNarrative Containers/Narrative TimesMedical DocumentsCross-Document AnalysisSummary
12. Afterword: The Future of Annotation
Crowdsourcing AnnotationAmazon’s Mechanical TurkGames with a Purpose (GWAP)User-Generated ContentHandling Big DataBoostingActive LearningSemi-Supervised LearningNLP Online and in the CloudDistributed ComputingShared Language ResourcesShared Language ApplicationsAnd Finally...
A. List of Available Corpora and Specifications
CorporaSpecifications, Guidelines, and Other ResourcesRepresentation Standards
B. List of Software Resources
Annotation and Adjudication SoftwareMultipurpose ToolsCorpus Creation and Exploration ToolsManual Annotation ToolsAutomated Annotation ToolsMultipurpose toolsPhonetic annotationPart-of-speech taggers/syntactic parsersTokenizers/chunkers/stemmersOtherMachine Learning Resources
C. MAE User Guide
Installing and Running MAELoading Tasks and FilesLoading a TaskLoading a FileAnnotating EntitiesAttribute informationNonconsuming tagsAnnotating LinksDeleting TagsSaving FilesDefining Your Own TaskTask NameElements (a.k.a. Tags)Attributesid attributesstart attributeAttribute typesDefault attribute valuesFrequently Asked Questions
D. MAI User Guide
Installing and Running MAILoading Tasks and FilesLoading a TaskLoading FilesAdjudicatingThe MAI WindowAdjudicating a TagExtent TagsLink TagsNonconsuming TagsAdding New TagsDeleting tagsSaving Files
E. Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright

Content preview from Natural Language Annotation for Machine Learning

Chapter 12. Afterword: The Future of Annotation

In this book we have endeavored to give you a taste of what it’s like to go through the entire process of doing annotation for training machine learning (ML) algorithms. The MATTER development cycle provides a tested and well-understood methodology for all the steps required in this endeavor, but it doesn’t tell you everything there is to know about annotation. In this chapter we look toward the future of annotation projects and ML algorithms, and show you some ways that the field of Natural Language Processing (NLP) is changing, as well as how those changes can help (or hurt) your own annotation and ML projects.

Crowdsourcing Annotation

As you have learned from working your way through the MATTER cycle, annotation is an expensive and time-consuming task. Therefore, you want to maximize the utility of your corpus to make the most of the time and energy you put into your task.

One way that people have tried to ameliorate the cost of large annotation projects is to use crowdsourcing—by making the task available to a large group of (usually untrained) people, it becomes both cheaper and faster to obtain annotated data, because the annotation is no longer being done by a handful of selected annotators, but rather by large groups of people.

If the concept of crowdsourcing seems strange, think about asking your friends on Facebook to recommend a restaurant, or consider what happens when a famous person uses Twitter to ask her followers for ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781449332693Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

Chapter 12. Afterword: The Future of Annotation

Crowdsourcing Annotation

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.