conference

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

by Julien Le Dem

October 2019

Intermediate to advanced

43m

English

O'Reilly Media, Inc.

Closed Captioning available in German, English, Spanish, French, Japanese, Korean, Portuguese (Portugal, Brazil), Chinese (Simplified), Chinese (Traditional)

Overview

Over the past 10 years, big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. With Hadoop, we started from a system that was good at looking for a needle in a haystack using snowplows. We had a lot of horsepower and scalability but lacked the subtlety and efficiency of relational databases. But since Hadoop provided the ultimate flexibility compared to the more constrained and rigid RDBMSs, we didn’t mind and plowed through.

However, machine learning, recommendations, matching, abuse detection, and data-driven products in general require a more flexible infrastructure. Over time, we started applying everything that had been known to the database world for decades to this new environment. We’d been told loud enough how Hadoop was a huge step backward. And it was true to some degree. The key difference was the flexibility of the Hadoop stack. There are many highly integrated components in a relational database and decoupling them took some time.

Today, we see the emergence of key components, such as optimizers, columnar storage, in-memory representation, table abstraction, and batch and streaming execution, as standards that provide the glue between the options available to process, analyze, and learn from our data. We’ve been deconstructing the tightly integrated relational database into flexible reusable open source components. Storage, compute, multitenancy, and batch or streaming execution are all decoupled and can be modified independently to fit every use case.

Julien Le Dem (WeWork) discusses the key open source components of the big data ecosystem—including Apache Calcite, Parquet, Arrow, Avro, and Kafka as well as batch and streaming systems—and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. (Parquet is the columnar data layout to optimize data at rest for querying. Arrow is the in-memory representation for maximum throughput execution and overhead-free data exchange. Calcite is the optimizer to make the most of our infrastructure capabilities.) Julien also explores the emerging components that are still missing or haven’t become standard yet to fully materialize the transformation to an extremely flexible database that lets you innovate with your data.

This session was recorded at the 2019 O'Reilly Strata Data Conference in San Francisco.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnb

BlueOrigin

Electronic Arts

HomeDepot

Nasdaq

Rakuten

Tata Consultancy Services

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

You might also like

Strata Data Superstream Series: Data Warehouses, Data Lakes, and Data Lakehouses

Strata Data Superstream Series: Data Warehouses, Data Lakes, and Data Lakehouses

Alistair Croll, Chris Messina, Michael Armbrust, Rukmani Gopalan, Joyce Kay Avila, Harshida Patel, Victor Lee, Barr Moses, Ryan Kearns, Paul Lacey, Alicia Moniz

Strata Data Conference - London, UK 2018

Strata Data Conference - London, UK 2018

O'Reilly Media, Inc.

Strata Data Conference 2019 - London, United Kingdom

Strata Data Conference 2019 - London, United Kingdom

O'Reilly Media Inc.

Training Kit Exam 70-462: Administering Microsoft® SQL Sever® 2012 Databases

Training Kit Exam 70-462: Administering Microsoft® SQL Sever® 2012 Databases

Peter Ward Orin Thomas and boB Taylor

Publisher Resources

ISBN: 0636920339847