Blueprints for Text Analytics Using Python
by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
Chapter 1. Gaining Early Insights from Textual Data
One of the first tasks in every data analytics and machine learning project is to become familiar with the data. In fact, it is always essential to have a basic understanding of the data to achieve robust results. Descriptive statistics provide reliable and robust insights and help to assess data quality and distribution.
When considering texts, frequency analysis of words and phrases is one of the main methods for data exploration. Though absolute word frequencies usually are not very interesting, relative or weighted frequencies are. When analyzing text about politics, for example, the most common words will probably contain many obvious and unsurprising terms such as people, country, government, etc. But if you compare relative word frequencies in text from different political parties or even from politicians in the same party, you can learn a lot from the differences.
What You’ll Learn and What We’ll Build
This chapter presents blueprints for the statistical analysis of text. It gets you started quickly and introduces basic concepts that you will need to know in subsequent chapters. We will start by analyzing categorical metadata and then focus on word frequency analysis and visualization.
After studying this chapter, you will have basic knowledge about text processing and analysis. You will know how to tokenize text, filter stop words, and analyze textual content with frequency diagrams and word clouds. We will also introduce ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access