book

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

October 2012

Beginner to intermediate

342 pages

9h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Natural Language Annotation for Machine LearningAudienceOrganization of This BookSoftware RequirementsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsJames Adds:Amber Adds:
1. The Basics
The Importance of Language AnnotationThe Layers of Linguistic DescriptionWhat Is Natural Language Processing?A Brief History of Corpus LinguisticsWhat Is a Corpus?Early Use of CorporaCorpora TodayKinds of AnnotationLanguage Data and Machine LearningClassificationClusteringStructured Pattern InductionThe Annotation Development CycleModel the PhenomenonAnnotate with the SpecificationTrain and Test the Algorithms over the CorpusEvaluate the ResultsRevise the Model and AlgorithmsSummary
2. Defining Your Goal and Dataset
Defining Your GoalThe Statement of PurposeRefining Your Goal: Informativity Versus CorrectnessThe scope of the annotation taskWhat will the annotation be used for?What will the overall outcome be?Where will the corpus come from?How will the result be achieved?Background ResearchLanguage ResourcesOrganizations and ConferencesNLP ChallengesAssembling Your DatasetThe Ideal Corpus: Representative and BalancedCollecting Data from the InternetEliciting Data from PeopleRead speechSpontaneous speechThe Size of Your CorpusExisting CorporaDistributions Within CorporaSummary
3. Corpus Analytics
Basic Probability for Corpus AnalyticsJoint Probability DistributionsBayes RuleCounting OccurrencesZipf’s LawN-gramsLanguage ModelsSummary
4. Building Your Model and Specification
Some Example Models and SpecsFilm Genre ClassificationAdding Named EntitiesSemantic RolesAdopting (or Not Adopting) Existing ModelsCreating Your Own Model and Specification: Generality Versus SpecificityUsing Existing Models and SpecificationsUsing Models Without SpecificationsDifferent Kinds of StandardsISO StandardsAnnotation format standardsAnnotation specification standardsCommunity-Driven StandardsOther Standards Affecting AnnotationSummary
5. Applying and Adopting Annotation Standards
Metadata Annotation: Document ClassificationUnique Labels: Movie ReviewsMultiple Labels: Film GenresText Extent Annotation: Named EntitiesInline AnnotationStand-off Annotation by TokensStand-off Annotation by Character LocationLinked Extent Annotation: Semantic RolesISO Standards and YouSummary
6. Annotation and Adjudication
The Infrastructure of an Annotation ProjectSpecification Versus GuidelinesBe Prepared to RevisePreparing Your Data for AnnotationMetadataPreprocessed DataSplitting Up the Files for AnnotationWriting the Annotation GuidelinesExample 1: Single Labels—Movie ReviewsExample 2: Multiple Labels—Film GenresExample 3: Extent Annotations—Named EntitiesExample 4: Link Tags—Semantic RolesAnnotatorsChoosing an Annotation EnvironmentEvaluating the AnnotationsCohen’s Kappa (κ)Fleiss’s Kappa (κ)Interpreting Kappa CoefficientsCalculating κ in Other ContextsCreating the Gold Standard (Adjudication)Summary
7. Training: Machine Learning
What Is Learning?Defining Our Learning TaskClassifier AlgorithmsDecision Tree LearningGender IdentificationNaïve Bayes LearningMovie genre identificationSentiment classificationMaximum Entropy ClassifiersOther Classifiers to Know AboutSequence Induction AlgorithmsClustering and Unsupervised LearningSemi-Supervised LearningMatching Annotation to AlgorithmsSummary
8. Testing and Evaluation
Testing Your AlgorithmEvaluating Your AlgorithmConfusion MatricesCalculating Evaluation ScoresPercentage accuracyPrecision and recallF-measureOther evaluation metricsInterpreting Evaluation ScoresProblems That Can Affect EvaluationDataset Is Too SmallAlgorithm Fits the Development Data Too WellToo Much Information in the AnnotationFinal Testing ScoresSummary
9. Revising and Reporting
Revising Your ProjectCorpus Distributions and ContentModel and SpecificationAnnotationGuidelinesAnnotatorsToolsTraining and TestingReporting About Your WorkAbout Your CorpusAbout Your Model and SpecificationsAbout Your Annotation Task and AnnotatorsAbout Your ML AlgorithmAbout Your RevisionsSummary

10. Annotation: TimeML
The Goal of TimeMLRelated ResearchBuilding the CorpusModel: Preliminary SpecificationsTimesSignalsEventsLinksAnnotation: First AttemptsModel: The TimeML Specification Used in TimeBankTime ExpressionsEventsSignalsLinksConfidenceAnnotation: The Creation of TimeBankTimeML Becomes ISO-TimeMLModeling the Future: Directions for TimeMLNarrative ContainersExpanding TimeML to Other DomainsEvent StructuresSummary
11. Automatic Annotation: Generating TimeML
The TARSQI ComponentsGUTime: Temporal Marker IdentificationEVITA: Event Recognition and ClassificationGUTenLINKSlinketSputLinkMachine Learning in the TARSQI ComponentsImprovements to the TTKStructural ChangesImprovements to Temporal Entity Recognition: BTimeTemporal Relation IdentificationTemporal Relation ValidationTemporal Relation VisualizationTimeML Challenges: TempEval-2TempEval-2: System SummariesOverview of ResultsFuture of the TTKNew Input FormatsNarrative Containers/Narrative TimesMedical DocumentsCross-Document AnalysisSummary
12. Afterword: The Future of Annotation
Crowdsourcing AnnotationAmazon’s Mechanical TurkGames with a Purpose (GWAP)User-Generated ContentHandling Big DataBoostingActive LearningSemi-Supervised LearningNLP Online and in the CloudDistributed ComputingShared Language ResourcesShared Language ApplicationsAnd Finally...
A. List of Available Corpora and Specifications
CorporaSpecifications, Guidelines, and Other ResourcesRepresentation Standards
B. List of Software Resources
Annotation and Adjudication SoftwareMultipurpose ToolsCorpus Creation and Exploration ToolsManual Annotation ToolsAutomated Annotation ToolsMultipurpose toolsPhonetic annotationPart-of-speech taggers/syntactic parsersTokenizers/chunkers/stemmersOtherMachine Learning Resources
C. MAE User Guide
Installing and Running MAELoading Tasks and FilesLoading a TaskLoading a FileAnnotating EntitiesAttribute informationNonconsuming tagsAnnotating LinksDeleting TagsSaving FilesDefining Your Own TaskTask NameElements (a.k.a. Tags)Attributesid attributesstart attributeAttribute typesDefault attribute valuesFrequently Asked Questions
D. MAI User Guide
Installing and Running MAILoading Tasks and FilesLoading a TaskLoading FilesAdjudicatingThe MAI WindowAdjudicating a TagExtent TagsLink TagsNonconsuming TagsAdding New TagsDeleting tagsSaving Files
E. Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright

Content preview from Natural Language Annotation for Machine Learning

Preface

This book is intended as a resource for people who are interested in using computers to help process natural language. A natural language refers to any language spoken by humans, either currently (e.g., English, Chinese, Spanish) or in the past (e.g., Latin, ancient Greek, Sanskrit). Annotation refers to the process of adding metadata information to the text in order to augment a computer’s capability to perform Natural Language Processing (NLP). In particular, we examine how information can be added to natural language text through annotation in order to increase the performance of machine learning algorithms—computer programs designed to extrapolate rules from the information provided over texts in order to apply those rules to unannotated texts later on.

Natural Language Annotation for Machine Learning

This book details the multistage process for building your own annotated natural language dataset (known as a corpus) in order to train machine learning (ML) algorithms for language-based data and knowledge discovery. The overall goal of this book is to show readers how to create their own corpus, starting with selecting an annotation task, creating the annotation specification, designing the guidelines, creating a “gold standard” corpus, and then beginning the actual data creation with the annotation process.

Because the annotation process is not linear, multiple iterations can be required for defining the tasks, annotations, and evaluations, in order to achieve the best ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781449332693Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

Preface

Natural Language Annotation for Machine Learning

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.