Chapter 6

Conditional Random Fields for Information Extraction 1

6.1. Introduction

In Natural Language Processing, the final ideal goal of allowing computers to understand all texts has, little by little, made way for more modest and pragmatic goals, which can be expressed as specific tasks. Information extraction is typically one of these tasks. It aims to identify factual information elements within a document, able to fill the fields of a predefined form. In a way, it aims to fill the gap between the way humans apprehend information, where the understanding of natural languages plays a large part, and the way computers do, in the form of typed data ordered in structured files or in databases. In a review article on the subject, McCallum discusses an information distillation process [MCC 05].

To achieve such a task, several methods have been used. As it is more and more the case for most other natural language engineering tasks, approaches based on statistical models are currently the most efficient. But this is true only when we correctly reformulate the task as an annotation or labeling problem. The best statistical models capable of learning data annotation are conditional random fields (CRFs).

This chapter is thus an opportunity to present the task of information extraction and the statistical labeling models able to handle it. The first two sections concentrate on the task, by discussing its issues and the specific problems posed. The following four sections focus on statistical ...

Get Textual Information Access: Statistical Models now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.