Data Structures and Algorithms in Python
by Michael T. Goodrich, Roberto Tamassia, Michael H. Goldwasser
Chapter 13
Text Processing

Contents
13.1 Abundance of Digitized Text
13.1.1 Notations for Strings and the Python str Class
13.2 Pattern-Matching Algorithms
13.2.2 The Boyer-Moore Algorithm
13.2.3 The Knuth-Morris-Pratt Algorithm
13.3.2 DNA and Text Sequence Alignment
13.4 Text Compression and the Greedy Method
13.1 Abundance of Digitized Text
Despite the wealth of multimedia information, text processing remains one of the dominant functions of computers. Computer are used to edit, store, and display documents, and to transport documents over the Internet. Furthermore, digital systems are used to archive a wide range of textual information, and new data is being generated at a rapidly increasing pace. A large corpus can readily surpass a petabyte of data (which is equivalent to a thousand terabytes, or a million gigabytes). Common examples of digital collections that include textual information are:
- Snapshots of the World Wide Web, as Internet document formats HTML and XML are primarily text formats, with added tags for multimedia content
- All documents stored locally on a user's computer
- Email archives
- Customer reviews
- Compilations ...