Named entity recognition

Building a web scraper that enriches an input dataset containing URLs with external web-based HTML content is of great business value within a big data ingestion service. But while an average data scientist should be able to study the returned content by using some basic clustering and classification techniques, an expert data scientist will bring this data enrichment process to the next level, by further enriching and adding value to it in post processes. Commonly, these value-added, post processes include disambiguating the external text content, extracting entities (like People, Places, and Dates), and converting raw text into its simplest grammatical form. We will explain in this section how to leverage the Spark framework ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.