1Guiding Ideas

This book concerns mathematical foundations of statistical language modeling, i.e. the question what kind of a probability distribution should be assigned to particular utterances of human languages. In this chapter, we will describe the core ideas of this book in a way which is less formalized mathematically, but more motivated linguistically. Based on the intuitions sketched in this chapter, in the following chapters, we will build rigorous mathematical constructions. The general goal is to develop a theory of discrete stochastic processes so that it would be able to account for certain statistical phenomena exhibited by human texts. The considered statistical phenomena take form of several power laws. We hope that if we were to succeed in a better modeling of these power laws, then in the long run, we may also obtain probabilistic models of language which are better in terms of performance measures used by engineers in computational linguistics. In other words, we hope that our quest for stochastic processes may turn out to be fruitful not only for purely theoretical interest but also for practical applications in engineering. We hope that the considered problems are also interesting enough on the theoretical side, and they can draw interest of professional mathematicians.

1.1 The Motivating Question

The fundamental question that motivates this book is

What kind of a statistical model may explain generation of texts in natural language, such as books, our ...

Get Information Theory Meets Power Laws now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.