Natural Language Processing with Java - Second Edition

Book description

Explore various approaches to organize and extract useful text from unstructured data using Java

Key Features

  • Use deep learning and NLP techniques in Java to discover hidden insights in text
  • Work with popular Java libraries such as CoreNLP, OpenNLP, and Mallet
  • Explore machine translation, identifying parts of speech, and topic modeling

Book Description

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.

You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.

By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

What you will learn

  • Understand basic NLP tasks and how they relate to one another
  • Discover and use the available tokenization engines
  • Apply search techniques to find people, as well as things, within a document
  • Construct solutions to identify parts of speech within sentences
  • Use parsers to extract relationships between elements of a document
  • Identify topics in a set of documents
  • Explore topic modeling from a document

Who this book is for

Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful but not mandatory.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Natural Language Processing with Java Second Edition
  3. Dedication
  4. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  5. Contributors
    1. About the authors
    2. About the reviewers
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  7. Introduction to NLP
    1. What is NLP?
    2. Why use NLP?
    3. Why is NLP so hard?
    4. Survey of NLP tools
      1. Apache OpenNLP
      2. Stanford NLP
      3. LingPipe
      4. GATE
      5. UIMA
      6. Apache Lucene Core
    5. Deep learning for Java
    6. Overview of text-processing tasks
      1. Finding parts of text
      2. Finding sentences
      3. Feature-engineering
      4. Finding people and things
      5. Detecting parts of speech
      6. Classifying text and documents
      7. Extracting relationships
      8. Using combined approaches
    7. Understanding NLP models
      1. Identifying the task
      2. Selecting a model
      3. Building and training the model
      4. Verifying the model
      5. Using the model
    8. Preparing data
    9. Summary
  8. Finding Parts of Text
    1. Understanding the parts of text
    2. What is tokenization?
      1. Uses of tokenizers
    3. Simple Java tokenizers
      1. Using the Scanner class
        1. Specifying the delimiter
      2. Using the split method
      3. Using the BreakIterator class
      4. Using the StreamTokenizer class
      5. Using the StringTokenizer class
      6. Performance considerations with Java core tokenization
    4. NLP tokenizer APIs
      1. Using the OpenNLPTokenizer class
        1. Using the SimpleTokenizer class
        2. Using the WhitespaceTokenizer class
        3. Using the TokenizerME class
      2. Using the Stanford tokenizer
        1. Using the PTBTokenizer class
        2. Using the DocumentPreprocessor class
        3. Using a pipeline
        4. Using LingPipe tokenizers
      3. Training a tokenizer to find parts of text
      4. Comparing tokenizers
    5. Understanding normalization
      1. Converting to lowercase
      2. Removing stopwords
        1. Creating a StopWords class
        2. Using LingPipe to remove stopwords
      3. Using stemming
        1. Using the Porter Stemmer
        2. Stemming with LingPipe
      4. Using lemmatization
        1. Using the StanfordLemmatizer class
        2. Using lemmatization in OpenNLP
      5. Normalizing using a pipeline
    6. Summary
  9. Finding Sentences
    1. The SBD process
    2. What makes SBD difficult?
    3. Understanding the SBD rules of LingPipe's HeuristicSentenceModel class
    4. Simple Java SBDs
      1. Using regular expressions
      2. Using the BreakIterator class
    5. Using NLP APIs
      1. Using OpenNLP
        1. Using the SentenceDetectorME class
        2. Using the sentPosDetect method
      2. Using the Stanford API
        1. Using the PTBTokenizer class
        2. Using the DocumentPreprocessor class
        3. Using the StanfordCoreNLP class
      3. Using LingPipe
        1. Using the IndoEuropeanSentenceModel class
        2. Using the SentenceChunker class
        3. Using the MedlineSentenceModel class
    6. Training a sentence-detector model
      1. Using the Trained model
      2. Evaluating the model using the SentenceDetectorEvaluator class
    7. Summary
  10. Finding People and Things
    1. Why is NER difficult?
    2. Techniques for name recognition
      1. Lists and regular expressions
      2. Statistical classifiers
    3. Using regular expressions for NER
      1. Using Java's regular expressions to find entities
      2. Using the RegExChunker class of LingPipe
    4. Using NLP APIs
      1. Using OpenNLP for NER
        1. Determining the accuracy of the entity
        2. Using other entity types
        3. Processing multiple entity types
      2. Using the Stanford API for NER
      3. Using LingPipe for NER
        1. Using LingPipe's named entity models
        2. Using the ExactDictionaryChunker class
    5. Building a new dataset with the NER annotation tool
    6. Training a model
      1. Evaluating a model
    7. Summary
  11. Detecting Part of Speech
    1. The tagging process
      1. The importance of POS taggers
      2. What makes POS difficult?
    2. Using the NLP APIs
      1. Using OpenNLP POS taggers
        1. Using the OpenNLP POSTaggerME class for POS taggers
        2. Using OpenNLP chunking
        3. Using the POSDictionary class
          1. Obtaining the tag dictionary for a tagger
          2. Determining a word's tags
          3. Changing a word's tags
          4. Adding a new tag dictionary
          5. Creating a dictionary from a file
      2. Using Stanford POS taggers
        1. Using Stanford MaxentTagger
        2. Using the MaxentTagger class to tag textese
        3. Using the Stanford pipeline to perform tagging
      3. Using LingPipe POS taggers
        1. Using the HmmDecoder class with Best_First tags
        2. Using the HmmDecoder class with NBest tags
        3. Determining tag confidence with the HmmDecoder class
      4. Training the OpenNLP POSModel
    3. Summary
  12. Representing Text with Features
    1. N-grams
    2. Word embedding
    3. GloVe
    4. Word2vec
    5. Dimensionality reduction
    6. Principle component analysis
    7. Distributed stochastic neighbor embedding
    8. Summary
  13. Information Retrieval
    1. Boolean retrieval
    2. Dictionaries and tolerant retrieval
      1. Wildcard queries
      2. Spelling correction
      3. Soundex
    3. Vector space model
    4. Scoring and term weighting
    5. Inverse document frequency
    6. TF-IDF weighting
    7. Evaluation of information retrieval systems
    8. Summary
  14. Classifying Texts and Documents
    1. How classification is used
    2. Understanding sentiment analysis
    3. Text-classifying techniques
    4. Using APIs to classify text
      1. Using OpenNLP
        1. Training an OpenNLP classification model
        2. Using DocumentCategorizerME to classify text
      2. Using the Stanford API
        1. Using the ColumnDataClassifier class for classification
        2. Using the Stanford pipeline to perform sentiment analysis
      3. Using LingPipe to classify text
        1. Training text using the Classified class
        2. Using other training categories
        3. Classifying text using LingPipe
        4. Sentiment analysis using LingPipe
        5. Language identification using LingPipe
    5. Summary
  15. Topic Modeling
    1. What is topic modeling?
    2. The basics of LDA
    3. Topic modeling with MALLET
      1. Training
      2. Evaluation
    4. Summary
  16. Using Parsers to Extract Relationships
    1. Relationship types
    2. Understanding parse trees
    3. Using extracted relationships
    4. Extracting relationships
    5. Using NLP APIs
      1. Using OpenNLP
      2. Using the Stanford API
        1. Using the LexicalizedParser class
        2. Using the TreePrint class
        3. Finding word dependencies using the GrammaticalStructure class
      3. Finding coreference resolution entities
    6. Extracting relationships for a question-answer system
      1. Finding the word dependencies
      2. Determining the question type
      3. Searching for the answer
    7. Summary
  17. Combined Pipeline
    1. Preparing data
    2. Using boilerpipe to extract text from HTML
    3. Using POI to extract text from Word documents
    4. Using PDFBox to extract text from PDF documents
    5. Using Apache Tika for content analysis and extraction
    6. Pipelines
    7. Using the Stanford pipeline
    8. Using multiple cores with the Stanford pipeline
    9. Creating a pipeline to search text
    10. Summary
  18. Creating a Chatbot
    1. Chatbot architecture
    2. Artificial Linguistic Internet Computer Entity
      1. Understanding AIML
      2. Developing a chatbot using ALICE and AIML
    3. Summary
  19. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Natural Language Processing with Java - Second Edition
  • Author(s): Richard M. Reese, AshishSingh Bhatia
  • Release date: July 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781788993494