Chapter 5. Information Extraction

What’s in a name? A rose by any other name would smell as sweet.

William Shakespeare

We deal with a lot of textual content every day, be it short messages on the phone or daily emails or longer texts we read for fun or at work or to catch up on current affairs. Such text documents are a rich source of information for us. Depending on the context, “information” can mean multiple things, such as key events, people, or relationships between people, places, or organizations, etc. Information extraction (IE) refers to the NLP task of extracting relevant information from text documents. An example of IE put to use in real-world applications are the short blurbs we see to the right when we search for a popular figure’s name on Google.

When compared to structured information sources like databases or tables or semi-structured sources such as webpages (which have some markup), text is a form of unstructured data. For example, in a database, we know where to look for something based on its schema. However, to a large extent, text documents typically comprise free-flowing text without a set schema. This makes IE a challenging problem. Texts may contain various kinds of information. In most cases, extracting information that has a fixed pattern (e.g., addresses, phone numbers, dates, etc.) is relatively straightforward using pattern-based extraction techniques like regular expressions, even though the text itself is considered unstructured data. However, ...

Get Practical Natural Language Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.