Chapter 9
Combinatorics of Strings
Strings are sequences of symbols such as letters or integers. This is a convention. The
most famous string—at least it should be the most famous—is the DNA molecule. To be
technical, Adenine, Guanine, Cytosine, and Thymine are nitrogenous bases that together
with a phosphate group each comprise nucleotide. It is a string of nucleotides. Still, nothing
stops us from referring to these four components as four letters combination of which creates
DNA. True, DNA has an additional structure, the double helix, which could bring us quite far
from our topic. In practice, the strand approach is the one taken in much of genomics, which
is all about algorithms dealing with the giant molecule. This is the abstraction we employ
from now on. The reason for our treatment is that a rich variety of algorithms can be seen as
performing string processing, such as program compilation, book typesetting, and genomics.
A set of letters, symbols, or characters, is called an alphabet. We only consider finite
alphabets—and usually of only two symbols. Finite sequences of letters drawn from a fixed
alphabet are called strings or words. A set of strings or words is called a language—a con-
cept of great importance in computer science. The common notation of a string in terms of
its letters is simply to juxtapose them, as we saw in §7.6.4. For example, let a and b be the
letters of our alphabet, then abaab is a word.
In this chapter, we formalize these concepts, and show how strings can be manipulated and
counted. In particular, word equations are used to count elements from regular languages.
Also we present a powerful concept—the finite state machine as an aid to approaching prob-
abilistic and combinatorial problems. In particular, we pay a special attention to waiting time
problems. The last section gives an introduction to one of the most powerful techniques in
applied probability—Markov chains.
489
490
Chapter 9. Combinatorics of Strings
9.1 Operations on Languages
The significance of strings in the analysis of algorithms is due to the fact that a surprising
fraction of our most important algorithms may be viewed as m erely string processing. Indeed,
computer programs are ordered collections of finite sequences of letters. Hence compilation
is a form of string processing. T he next couple of pages are devoted to a different kind of
a sequence, a sequence of definitions, which are expected to present nothing new, but will
establish our terminology and notation.
Definition 9.1 A nonempty finite set of symbols is called an alphabet. The elements of an
alphabet are called letters.
In some contexts you will find people use the term vocabulary instead of alphabet, with
identical meaning. Because the elements of vocabulary are called words, which is our usual
term for strings of letters, we do not use this terminology here.
Definition 9.2 A word over the alphabet Σ is a finite sequence of letters, which are elements
of Σ. The number of letters in a word w is called its length and is denoted by |w|. The empty
string or empty word, denoted by
ε
, is a special string: it contains no letters. ⊳
Note that the empty string,
ε
, is different from ∅, the empty set. Here is a nonempty set: {
ε
};
it has one element, the empty word. An explicit notation for the empty set is {}.
Our standard notation for a word of length n is w = a
1
a
2
···a
n
, with n letters a
1
, a
2
, .. . , a
n
.
Exercise 9.3 [1–] (a) What is the length of
ε
, the empty string?
(b) H ow many words of length k exist, if the alphabet has r letters?
Definition 9.4 Let a = a
1
a
2
···a
n
and b = b
1
b
2
···b
m
be two words, of lengths n and m,
respectively. The catenation (also called concatenation) of the two strings a and b is a
string
a.b = a
1
a
2
···a
n
b
1
b
2
···b
m
,
of length n + m. The dot symbol denotes the catenation operator; it is usually omitted. We
denote the catenation of two strings a and b by a simple juxtaposition, ab. A word is not
changed when catenated with
ε
: w
ε
=
ε
w = w. The catenation of two
ε
’s introduces nothing
new, naturally. ⊳
Note that the definition does not require the words to be over the same alphabet. However, in
most applications this does not matter.
Definition 9.5 Let Σ be an alphabet and n be a positive integer. We define the powers of Σ
(which are Cartesian products of Σ with itself) recursively as follows:
1. Σ
0
= {
ε
},
2. Σ
n+1
= Σ ×Σ
n
= {a.y |a ∈ Σ, y ∈ Σ
n
},
where we chose to use the catenation operator (dot here) explicitly in a.y. ⊳
Get Methods in Algorithmic Analysis now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.