Chapter 4. Data (Science) Pipelines

This chapter tackles the nitty-gritty of data work—not on the data side but the data scientist’s side. Ben Lorica tackles the combinations of tools available off-the-shelf (and the platforms that enable combining tools), as well as the process of feature discovery and selection. 

Verticalized Big Data Solutions

General-purpose platforms can come across as hammers in search of nails

by Ben Lorica

As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focused on specific verticals. At their core, big data applications merge large amounts of real-time and static data to improve decision-making:

Data fusion diagram

This simple idea can be hard to execute in practice (think volume, variety, velocity). Unlocking value from disparate data sources entails some familiarity with domain-specific1 data sources, requirements, and business problems.

It’s difficult enough to solve a specific problem, let alone a generic one. Consider the case of Guavus—a successful startup that builds big data solutions for the telecom industry (“communication service providers”). Its founder2 was very familiar with the data sources in telecom, and knew the types of applications that would resonate within that industry. Once they solve one set of problems for a telecom company (network ...

Get Big Data Now: 2014 Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.