Chapter 3. Text
Introduction
Credit: Fred L. Drake, Jr., PythonLabs
Text-processing applications form a substantial part of the application space for any scripting language, if only because everyone can agree that text processing is useful. Everyone has bits of text that need to be reformatted or transformed in various ways. The catch, of course, is that every application is just a little bit different from every other application, so it can be difficult to find just the right reusable code to work with different file formats, no matter how similar they are.
What Is Text?
Sounds like an easy question, doesn’t it? After all, we know it when we see it, don’t we? Text is a sequence of characters, and it is distinguished from binary data by that very fact. Binary data, after all, is a sequence of bytes.
Unfortunately, all data enters our applications as a sequence of bytes. There’s no library function we can call that will tell us whether a particular sequence of bytes represents text, although we can create some useful heuristics that tell us whether data can safely (not necessarily correctly) be handled as text.
Python
strings are immutable sequences of bytes or characters. Most of the
ways we create and process strings treat them as sequences of
characters, but many are just as applicable to sequences of bytes.
Unicode strings are immutable
sequences of Unicode characters: transformations of Unicode strings
into and from plain strings use codecs (coder-decoder) objects that embody knowledge ...