Chapter 3

Logistic Regression and Text Classification 1

3.1. Introduction

Machine learning has been the focus of many studies in recent years. Given an unknown source generating data of which only one sample is available, learning is the induction process aiming at modeling the source from the avilable sample. The model can then be used to generate new data and reason about it. This setting occurs in many situations. For example, the information about a client, the source, is often only partially available through a questionnaire; the behavior of a system is observed through a set of physical measurements captured by sensors. The generating source is sometimes modeled by an expert the more complex the source, the more difficult and error-prone the modeling task. Machine learning is in this case an elegant solution which automatizes the work of the expert with the capacity to process large volumes of data.

Supervised learning is a branch of machine learning which is characterized by the fact that the data generated by the source is comprised of so-called independent data x and dependent data y which is correlated with the independent data and constitutes the “annotation”, hence the name “supervised”. Both independent and dependent data are observed in the available sample. In many cases, the independent data x takes on real values and can be represented as a vector. This particular setting is often referred to as statistical learning and is one of the most active branches of machine ...

Get Textual Information Access: Statistical Models now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.