Methods in Algorithmic Analysis

Chapter 9

Combinatorics of Strings

Strings are sequences of symbols such as letters or integers. This is a convention. The

most famous string—at least it should be the most famous—is the DNA molecule. To be

technical, Adenine, Guanine, Cytosine, and Thymine are nitrogenous bases that together

with a phosphate group each comprise nucleotide. It is a string of nucleotides. Still, nothing

stops us from referring to these four components as four letters combination of which creates

DNA. True, DNA has an additional structure, the double helix, which could bring us quite far

from our topic. In practice, the strand approach is the one taken in much of genomics, which

is all about algorithms dealing with the giant molecule. This is the abstraction we employ

from now on. The reason for our treatment is that a rich variety of algorithms can be seen as

performing string processing, such as program compilation, book typesetting, and genomics.

A set of letters, symbols, or characters, is called an alphabet. We only consider ﬁnite

alphabets—and usually of only two symbols. Finite sequences of letters drawn from a ﬁxed

alphabet are called strings or words. A set of strings or words is called a language—a con-

cept of great importance in computer science. The common notation of a string in terms of

its letters is simply to juxtapose them, as we saw in §7.6.4. For example, let a and b be the

letters of our alphabet, then abaab is a word.

In this chapter, we formalize these concepts, and show how strings can be manipulated and

counted. In particular, word equations are used to count elements from regular languages.

Also we present a powerful concept—the ﬁnite state machine as an aid to approaching prob-

abilistic and combinatorial problems. In particular, we pay a special attention to waiting time

problems. The last section gives an introduction to one of the most powerful techniques in

applied probability—Markov chains.

489

490

Chapter 9. Combinatorics of Strings

9.1 Operations on Languages

The signiﬁcance of strings in the analysis of algorithms is due to the fact that a surprising

fraction of our most important algorithms may be viewed as m erely string processing. Indeed,

computer programs are ordered collections of ﬁnite sequences of letters. Hence compilation

is a form of string processing. T he next couple of pages are devoted to a different kind of

a sequence, a sequence of deﬁnitions, which are expected to present nothing new, but will

establish our terminology and notation.

Deﬁnition 9.1 A nonempty ﬁnite set of symbols is called an alphabet. The elements of an

alphabet are called letters.

In some contexts you will ﬁnd people use the term vocabulary instead of alphabet, with

identical meaning. Because the elements of vocabulary are called words, which is our usual

term for strings of letters, we do not use this terminology here.

Deﬁnition 9.2 A word over the alphabet Σ is a ﬁnite sequence of letters, which are elements

of Σ. The number of letters in a word w is called its length and is denoted by |w|. The empty

string or empty word, denoted by

, is a special string: it contains no letters. ⊳

Note that the empty string,

, is different from ∅, the empty set. Here is a nonempty set: {

};

it has one element, the empty word. An explicit notation for the empty set is {}.

Our standard notation for a word of length n is w = a

···a

, with n letters a

, a

, .. . , a

Exercise 9.3 [1–] (a) What is the length of

, the empty string?

(b) H ow many words of length k exist, if the alphabet has r letters?

Deﬁnition 9.4 Let a = a

···a

and b = b

···b

be two words, of lengths n and m,

respectively. The catenation (also called concatenation) of the two strings a and b is a

string

a.b = a

···a

···b

of length n + m. The dot symbol denotes the catenation operator; it is usually omitted. We

denote the catenation of two strings a and b by a simple juxtaposition, ab. A word is not

changed when catenated with

: w

w = w. The catenation of two

’s introduces nothing

new, naturally. ⊳

Note that the deﬁnition does not require the words to be over the same alphabet. However, in

most applications this does not matter.

Deﬁnition 9.5 Let Σ be an alphabet and n be a positive integer. We deﬁne the powers of Σ

(which are Cartesian products of Σ with itself) recursively as follows:

1. Σ

= {

2. Σ

n+1

= Σ ×Σ

= {a.y |a ∈ Σ, y ∈ Σ

where we chose to use the catenation operator (dot here) explicitly in a.y. ⊳

Get Methods in Algorithmic Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Methods in Algorithmic Analysis by Vladimir A. Dobrushkin

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly