Managing the Data Lake

Book description

Organizations across many industries have recently created fast-growing repositories to deal with an influx of new data from many sources and often in multiple formats. To manage these data lakes, companies have begun to leave the familiar confines of relational databases and data warehouses for Hadoop and various big data solutions. But adopting new technology alone won’t solve the problem.

Based on interviews with several experts in data management, author Andy Oram provides an in-depth look at common issues you’re likely to encounter as you consider how to manage business data. You’ll explore five key topic areas, including:

  • Acquisition and ingestion: how to solve these problems with a degree of automation.
  • Metadata: how to keep track of when data came in and how it was formatted, and how to make it available at later stages of processing.
  • Data preparation and cleaning: what you need to know before you prepare and clean your data, and what needs to be cleaned up and how.
  • Organizing workflows: what you should do to combine your tasks—ingestion, cataloging, and data preparation—into an end-to-end workflow.
  • Access control: how to address security and access controls at all stages of data handling.

Andy Oram, an editor at O’Reilly Media since 1992, currently specializes in programming. His work for O'Reilly includes the first books on Linux ever published commercially in the United States.

Publisher resources

View/Submit Errata

Product information

  • Title: Managing the Data Lake
  • Author(s): Andy Oram
  • Release date: September 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491941676