Description:

Are you polluting your data lake?

Modern data infrastructures are fed by vast volumes of data, streamed from an ever-changing variety of sources. Standard practice has been to store the data as ingested and force data cleaning onto each consuming application. This approach saddles data scientists and analysts with substantial work, creates delays getting to insights and makes real-time or near-time analysis practically impossible.

In this webcast you will discover:

Recipes for building automated ingest pipelines that implement continual in-stream sanitization so that data lands in stores ready to consume, regardless of the complexity of collecting it.
Methods for making your pipelines resistant to data drift - the inevitable changes in schema, semantics and infrastructure that breaks pipelines.
Open source tools that allow you to create and maintain these pipelines with little to no hand coding.

About Arvind Prabhakar, CTO & Co-founder — StreamSets

Arvind Prabhakar is CTO and Co-Founder of StreamSets, a Big Data startup headquartered in San Francisco. He is an Apache Software Foundation member, former PMC Chair for Flume and Sqoop projects, PMC member on Storm and MetaModel projects. Prior to StreamSets, Arvind was director of engineering at Cloudera and software architect in the core platform engineering team at Informatica.

Description:

About Arvind Prabhakar, CTO & Co-founder — StreamSets

About O'Reilly

Community

Partner Sites

Shop O'Reilly