book

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

October 2012

Beginner to intermediate

342 pages

9h 55m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Natural Language Annotation for Machine LearningAudienceOrganization of This BookSoftware RequirementsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsJames Adds:Amber Adds:
1. The Basics
The Importance of Language AnnotationThe Layers of Linguistic DescriptionWhat Is Natural Language Processing?A Brief History of Corpus LinguisticsWhat Is a Corpus?Early Use of CorporaCorpora TodayKinds of AnnotationLanguage Data and Machine LearningClassificationClusteringStructured Pattern InductionThe Annotation Development CycleModel the PhenomenonAnnotate with the SpecificationTrain and Test the Algorithms over the CorpusEvaluate the ResultsRevise the Model and AlgorithmsSummary
2. Defining Your Goal and Dataset
Defining Your GoalThe Statement of PurposeRefining Your Goal: Informativity Versus CorrectnessThe scope of the annotation taskWhat will the annotation be used for?What will the overall outcome be?Where will the corpus come from?How will the result be achieved?Background ResearchLanguage ResourcesOrganizations and ConferencesNLP ChallengesAssembling Your DatasetThe Ideal Corpus: Representative and BalancedCollecting Data from the InternetEliciting Data from PeopleRead speechSpontaneous speechThe Size of Your CorpusExisting CorporaDistributions Within CorporaSummary
3. Corpus Analytics
Basic Probability for Corpus AnalyticsJoint Probability DistributionsBayes RuleCounting OccurrencesZipf’s LawN-gramsLanguage ModelsSummary
4. Building Your Model and Specification
Some Example Models and SpecsFilm Genre ClassificationAdding Named EntitiesSemantic RolesAdopting (or Not Adopting) Existing ModelsCreating Your Own Model and Specification: Generality Versus SpecificityUsing Existing Models and SpecificationsUsing Models Without SpecificationsDifferent Kinds of StandardsISO StandardsAnnotation format standardsAnnotation specification standardsCommunity-Driven StandardsOther Standards Affecting AnnotationSummary
5. Applying and Adopting Annotation Standards
Metadata Annotation: Document ClassificationUnique Labels: Movie ReviewsMultiple Labels: Film GenresText Extent Annotation: Named EntitiesInline AnnotationStand-off Annotation by TokensStand-off Annotation by Character LocationLinked Extent Annotation: Semantic RolesISO Standards and YouSummary
6. Annotation and Adjudication
The Infrastructure of an Annotation ProjectSpecification Versus GuidelinesBe Prepared to RevisePreparing Your Data for AnnotationMetadataPreprocessed DataSplitting Up the Files for AnnotationWriting the Annotation GuidelinesExample 1: Single Labels—Movie ReviewsExample 2: Multiple Labels—Film GenresExample 3: Extent Annotations—Named EntitiesExample 4: Link Tags—Semantic RolesAnnotatorsChoosing an Annotation EnvironmentEvaluating the AnnotationsCohen’s Kappa (κ)Fleiss’s Kappa (κ)Interpreting Kappa CoefficientsCalculating κ in Other ContextsCreating the Gold Standard (Adjudication)Summary
7. Training: Machine Learning
What Is Learning?Defining Our Learning TaskClassifier AlgorithmsDecision Tree LearningGender IdentificationNaïve Bayes LearningMovie genre identificationSentiment classificationMaximum Entropy ClassifiersOther Classifiers to Know AboutSequence Induction AlgorithmsClustering and Unsupervised LearningSemi-Supervised LearningMatching Annotation to AlgorithmsSummary
8. Testing and Evaluation
Testing Your AlgorithmEvaluating Your AlgorithmConfusion MatricesCalculating Evaluation ScoresPercentage accuracyPrecision and recallF-measureOther evaluation metricsInterpreting Evaluation ScoresProblems That Can Affect EvaluationDataset Is Too SmallAlgorithm Fits the Development Data Too WellToo Much Information in the AnnotationFinal Testing ScoresSummary
9. Revising and Reporting
Revising Your ProjectCorpus Distributions and ContentModel and SpecificationAnnotationGuidelinesAnnotatorsToolsTraining and TestingReporting About Your WorkAbout Your CorpusAbout Your Model and SpecificationsAbout Your Annotation Task and AnnotatorsAbout Your ML AlgorithmAbout Your RevisionsSummary

10. Annotation: TimeML
The Goal of TimeMLRelated ResearchBuilding the CorpusModel: Preliminary SpecificationsTimesSignalsEventsLinksAnnotation: First AttemptsModel: The TimeML Specification Used in TimeBankTime ExpressionsEventsSignalsLinksConfidenceAnnotation: The Creation of TimeBankTimeML Becomes ISO-TimeMLModeling the Future: Directions for TimeMLNarrative ContainersExpanding TimeML to Other DomainsEvent StructuresSummary
11. Automatic Annotation: Generating TimeML
The TARSQI ComponentsGUTime: Temporal Marker IdentificationEVITA: Event Recognition and ClassificationGUTenLINKSlinketSputLinkMachine Learning in the TARSQI ComponentsImprovements to the TTKStructural ChangesImprovements to Temporal Entity Recognition: BTimeTemporal Relation IdentificationTemporal Relation ValidationTemporal Relation VisualizationTimeML Challenges: TempEval-2TempEval-2: System SummariesOverview of ResultsFuture of the TTKNew Input FormatsNarrative Containers/Narrative TimesMedical DocumentsCross-Document AnalysisSummary
12. Afterword: The Future of Annotation
Crowdsourcing AnnotationAmazon’s Mechanical TurkGames with a Purpose (GWAP)User-Generated ContentHandling Big DataBoostingActive LearningSemi-Supervised LearningNLP Online and in the CloudDistributed ComputingShared Language ResourcesShared Language ApplicationsAnd Finally...
A. List of Available Corpora and Specifications
CorporaSpecifications, Guidelines, and Other ResourcesRepresentation Standards
B. List of Software Resources
Annotation and Adjudication SoftwareMultipurpose ToolsCorpus Creation and Exploration ToolsManual Annotation ToolsAutomated Annotation ToolsMultipurpose toolsPhonetic annotationPart-of-speech taggers/syntactic parsersTokenizers/chunkers/stemmersOtherMachine Learning Resources
C. MAE User Guide
Installing and Running MAELoading Tasks and FilesLoading a TaskLoading a FileAnnotating EntitiesAttribute informationNonconsuming tagsAnnotating LinksDeleting TagsSaving FilesDefining Your Own TaskTask NameElements (a.k.a. Tags)Attributesid attributesstart attributeAttribute typesDefault attribute valuesFrequently Asked Questions
D. MAI User Guide
Installing and Running MAILoading Tasks and FilesLoading a TaskLoading FilesAdjudicatingThe MAI WindowAdjudicating a TagExtent TagsLink TagsNonconsuming TagsAdding New TagsDeleting tagsSaving Files
E. Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright

Content preview from Natural Language Annotation for Machine Learning

Appendix A. List of Available Corpora and Specifications

This appendix was compiled primarily from the LRE Resource Map. Many thanks to Nicoletta Calzolari and Riccardo del Gratta for their help in creating this appendix, and for allowing us to reprint this information here.

Please note that this appendix does not represent a complete list of all the existing software for the various tasks listed here. It is intended to provide a general overview of the different corpora and specifications available, to give you an idea of what resources you can use in your own annotation and machine learning (ML) tasks. For the most up-to-date list of resources, check the LRE Resource Map, or just do a web search to see what else is available.

Corpora

A Reference Dependency Bank for Analyzing Complex Predicates

Modality: Written

Languages: Hindi/Urdu

Annotation: Semantic dependencies

URL: http://ling.uni-konstanz.de/pages/home/pargram_urdu/main/Resources.html

A Treebank for Finnish (FinnTreeBank)

Modality: Written

Language: Finnish

Annotation: Treebank

Size: 17,000 model sentences

URL: http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/index.shtml

ALLEGRA (ALigned press reLEases of the GRisons Administration)

Modality: Written

Languages: German, Romansh, Italian

URL: http://www.latl.unige.ch/allegra/

AnCora

Modality: Written

Language: Catalan

Annotations: Lemma and part of speech, syntactic constituents and functions, argument structure and thematic roles, semantic classes of the verb, Named Entities, coreference ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781449332693Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

Appendix A. List of Available Corpora and Specifications

Corpora

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.