5Text Analytics

A brief description of text analysis is given in this chapter, where we make use of the basic informatics tools introduced thus far. The description begins with how to get the text data of interest, including a variety of internet sources and internet website “scraping” methods. Python‐based code is then used to perform text analysis in the context of words (Section 5.1), short phrases (Section 5.2), and long phrases (Section 5.3). These are very basic tools that when coupled with Google translate and other Deep Learning (Chapter 13), as well as Natural Language Processing (NLP) tools, provide an analytical basis for automated language processing.

5.1 Words

The basic unit of language, words, are reflected by clear separation into word‐units in text. In this section, we will explore methods to do text analysis at the (single) word level, starting at word‐frequency analysis (Section 5.1.2), followed by a “meta‐analysis,” using a sentiment table that scores individual words according to positive or negative “sentiment” to identify sentiment in a passage and associate it with keywords identified and present in that passage (Section 5.1.3). Before getting into the implementation, however, we need some data to analyze. In Section 5.1.1, examples are given of basic text acquisition from repositories of many classic written works (the Gutenberg project), or a local copy of rapid‐searchable Wikipedia (using kiwix), or of text‐scraping methods for data off of any website ...

Get Informatics and Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.