Chapter 12. Text Analysis and Generation

At this point we have covered Python’s core data structures—lists, dictionaries, and tuples—and some algorithms that use them. In this chapter, we’ll use them to explore text analysis and Markov generation:

  • Text analysis is a way to describe the statistical relationships between the words in a document, like the probability that one word is followed by another.

  • Markov generation is a way to generate new text with words and phrases similar to the original text.

These algorithms are similar to parts of a large language model (LLM), which is the key component of a chatbot.

We’ll start by counting the number of times each word appears in a book. Then we’ll look at pairs of words and make a list of the words that can follow each word. We’ll make a simple version of a Markov generator, and as an exercise, you’ll have a chance to make a more general version.

Unique Words

As a first step toward text analysis, let’s read a book—The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson—and count the number of unique words. Instructions for downloading the book are in the notebook for this chapter:

filename = 'dr_jekyll.txt'
       

We’ll use a for loop to read lines from the file and split to divide the lines into words. Then, to keep track of unique words, we’ll store each word as a key in a dictionary:

unique_words = {}
    for line in open(filename):
        seq = line.split()
        for word in seq:
            unique_words[word] = 1

len(unique_words)
       
 6040 ...

Get Think Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.