book

Natural Language Processing with Java

Name: Natural Language Processing with Java
ISBN: 9781784391799

by Richard M. Reese, Richard M Reese

March 2015

Beginner to intermediate

262 pages

5h 28m

English

Packt Publishing

Read now

Unlock full access

Natural Language Processing with Java
Table of Contents
Natural Language Processing with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Introduction to NLP
What is NLP?
Why use NLP?
Why is NLP so hard?
Survey of NLP tools
Apache OpenNLPStanford NLPLingPipeGATEUIMA
Overview of text processing tasks
Finding parts of textFinding sentencesFinding people and thingsDetecting Parts of SpeechClassifying text and documentsExtracting relationshipsUsing combined approaches
Understanding NLP models
Identifying the taskSelecting a modelBuilding and training the modelVerifying the modelUsing the model
Preparing data
Summary
2. Finding Parts of Text
Understanding the parts of text
What is tokenization?
Uses of tokenizers
Simple Java tokenizers
Using the Scanner classSpecifying the delimiterUsing the split methodUsing the BreakIterator classUsing the StreamTokenizer classUsing the StringTokenizer classPerformance considerations with java core tokenization
NLP tokenizer APIs
Using the OpenNLPTokenizer classUsing the SimpleTokenizer classUsing the WhitespaceTokenizer classUsing the TokenizerME classUsing the Stanford tokenizerUsing the PTBTokenizer classUsing the DocumentPreprocessor classUsing a pipelineUsing LingPipe tokenizersTraining a tokenizer to find parts of textComparing tokenizers
Understanding normalization
Converting to lowercaseRemoving stopwordsCreating a StopWords classUsing LingPipe to remove stopwordsUsing stemmingUsing the Porter StemmerStemming with LingPipeUsing lemmatizationUsing the StanfordLemmatizer classUsing lemmatization in OpenNLPNormalizing using a pipeline
Summary
3. Finding Sentences
The SBD process
What makes SBD difficult?
Understanding SBD rules of LingPipe's HeuristicSentenceModel class
Simple Java SBDs
Using regular expressionsUsing the BreakIterator class
Using NLP APIs
Using OpenNLPUsing the SentenceDetectorME classUsing the sentPosDetect methodUsing the Stanford APIUsing the PTBTokenizer classUsing the DocumentPreprocessor classUsing the StanfordCoreNLP classUsing LingPipeUsing the IndoEuropeanSentenceModel classUsing the SentenceChunker classUsing the MedlineSentenceModel class
Training a Sentence Detector model
Using the Trained modelEvaluating the model using the SentenceDetectorEvaluator class
Summary
4. Finding People and Things
Why NER is difficult?
Techniques for name recognition
Lists and regular expressionsStatistical classifiers
Using regular expressions for NER
Using Java's regular expressions to find entitiesUsing LingPipe's RegExChunker class
Using NLP APIs
Using OpenNLP for NERDetermining the accuracy of the entityUsing other entity typesProcessing multiple entity typesUsing the Stanford API for NERUsing LingPipe for NERUsing LingPipe's name entity modelsUsing the ExactDictionaryChunker class
Training a model
Evaluating a model
Summary
5. Detecting Part of Speech
The tagging processImportance of POS taggersWhat makes POS difficult?
Using the NLP APIs
Using OpenNLP POS taggersUsing the OpenNLP POSTaggerME class for POS taggersUsing OpenNLP chunkingUsing the POSDictionary classObtaining the tag dictionary for a taggerDetermining a word's tagsChanging a word's tagsAdding a new tag dictionaryCreating a dictionary from a fileUsing Stanford POS taggersUsing Stanford MaxentTaggerUsing the MaxentTagger class to tag texteseUsing Stanford pipeline to perform taggingUsing LingPipe POS taggersUsing the HmmDecoder class with Best_First tagsUsing the HmmDecoder class with NBest tagsDetermining tag confidence with the HmmDecoder classTraining the OpenNLP POSModel
Summary
6. Classifying Texts and Documents
How classification is used
Understanding sentiment analysis
Text classifying techniques
Using APIs to classify text
Using OpenNLPTraining an OpenNLP classification modelUsing DocumentCategorizerME to classify textUsing Stanford APIUsing the ColumnDataClassifier class for classificationUsing the Stanford pipeline to perform sentiment analysisUsing LingPipe to classify textTraining text using the Classified classUsing other training categoriesClassifying text using LingPipeSentiment analysis using LingPipeLanguage identification using LingPipe
Summary
7. Using Parser to Extract Relationships
Relationship types
Understanding parse trees
Using extracted relationships
Extracting relationships
Using NLP APIs
Using OpenNLPUsing the Stanford APIUsing the LexicalizedParser classUsing the TreePrint classFinding word dependencies using the GrammaticalStructure classFinding coreference resolution entities
Extracting relationships for a question-answer system
Finding the word dependenciesDetermining the question typeSearching for the answer
Summary
8. Combined Approaches
Preparing dataUsing Boilerpipe to extract text from HTMLUsing POI to extract text from Word documentsUsing PDFBox to extract text from PDF documents
Pipelines
Using the Stanford pipelineUsing multiple cores with the Stanford pipeline
Creating a pipeline to search text
Summary
Index

Content preview from Natural Language Processing with Java

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need at times to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important.

Character	Meaning
Unicode space character	(space_separator, line_separator, or paragraph_separator)
`\t`	U+0009 horizontal tabulation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Natural Language Processing with Java - Second Edition

Publisher Resources

ISBN: 9781784391799

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Processing with Java

by Richard M. Reese, Richard M Reese

What is tokenization?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.