Data Superstream: Data Lakes and Warehouses
Published by O'Reilly Media, Inc.
Storing, processing, and moving data in the cloud efficiently and cost-effectively is a must for working with today’s enormous datasets. These expert-led sessions will help you gain insight into how to increase the scalability, speed, and availability of your data, along with best practices for utilizing your data warehouse, data lake, or data lakehouse.
About the Data Superstream Series: This three-part Superstream series is designed to help your organization maximize the business impact of your data. Each day covers different topics, with unique sessions lasting no more than four hours. And they’re packed with insights from key innovators and the latest tools and technologies to help you stay ahead of it all.
What you’ll learn and how you can apply it
- Get an overview of the latest technologies for storing and managing your data
- Learn cutting-edge strategies for optimizing and deploying your cloud data lake
- Understand how to implement access control to maintain data privacy in your cloud data warehouse
- Find out how to utilize the lakehouse architecture to support ML and AI applications
- Discover the benefits of a data mesh approach for addressing data ownership challenges in your organization
This live event is for you because...
- You're a data or software engineer or solution architect interested in learning about the latest trends in storing, processing, and managing data.
- You want to improve the scalability, speed, and availability of your data.
- You want to better understand the systems that you already use and learn how to take full advantage of their capabilities.
Prerequisites
- Come with your questions
- Have a pen and paper handy to capture notes, insights, and inspiration
Recommended follow-up:
- Read The Enterprise Big Data Lake (book)
- Read Automating the Modern Data Warehouse (report)
- Take Data Mesh in Practice (live online training course with Max Schultze and Arif Wider)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Alistair Croll: Introduction (5 minutes) - 8:00am PT | 11:00am ET | 4:00pm UTC/GMT
- Alistair Croll welcomes you to the Data Superstream.
Lena Hall: Keynote—The Continuously Evolving Data Landscape and What It Means for Your Cloud Architecture (20 minutes) - 8:05am PT | 11:05am ET | 4:05pm UTC/GMT
- It’s a fascinating time to be an architect, data engineer, or cloud practitioner. A decade ago, the universe of architectural patterns and technical solutions was a lot smaller. Many things were harder to implement, though it was easier to evaluate options and keep track of newly emerging topics. Modern practitioners and decision-makers need to understand the current ecosystem of choices deeply enough and broadly enough to make the right decision. In her keynote address, Lena Hall walks you through the evolution of data and cloud architecture and explores its future trajectory. You’ll learn pragmatic and practical lessons that will help you navigate uncertainty when considering the adoption of rising concepts or solutions. It’s hard to operate in a world of hundreds of trade-offs. Join in to learn how to shape your mindset and decision framework to best handle it.
- Lena Hall is head of AWS developer relations in North America, influencing the future of cloud technologies and developer experience for hands-on builders. She’s the driver of engineering initiatives to facilitate and advance further acceleration of cloud services. Lena has more than 10 years of experience in solution architecture and software engineering with a focus on distributed cloud programming, real-time system design, highly scalable and performant systems, big data analysis, data science, functional programming, and machine learning. Previously, she was a director of engineering for Azure at Microsoft, where she focused on large-scale distributed systems and modern architectures. She co-organizes the ML4ALL conference and is often an invited member of program committees for conferences like Kafka Summit, Lambda World, and others.
Vini Jaiswal: Building Lakehouse Architecture for Artificial Intelligence (30 minutes) - 8:25am PT | 11:25am ET | 4:25pm UTC/GMT
- Join Vini Jaiswal as she reviews the data lakehouse paradigm and explores the role it can play in training and deploying AI applications. You'll learn how modern applications have evolved to use the data lakehouse, core concepts of designing data pipelines, under-the-hood processes like ACID and MVCC, and best practices for guaranteeing a pristine data lake to support successful ML applications.
- Vini Jaiswal is a developer advocate at Databricks, where she helps data practitioners build on Apache Spark, Delta Lake, Databricks, MLflow, and other open source technologies. Vini has over nine years of data and cloud experience working with unicorns, digital natives, and Fortune 500 companies. Previously, she was Citi’s VP engineering lead for data science, where she drove engineering efforts and led the deployment of highly scalable data science and ML architecture on the global cloud.
Einat Orr: Rethinking Data Deployment—CI/CD for Data Lakes (30 minutes) - 8:55am PT | 11:55am ET | 4:55pm UTC/GMT
- At first glance, deploying data in a data lake may seem like a one-step process: you simply add the dataset to the production location in the object store. What else is there to do? It turns out that there is more you should do, and blindly writing new data introduces a host of potential problems. For example, how do you know the data you write is accurate and conforms to best practices such as format and schema? The truth is, once you’ve written data to the production location of your lake, consumers can use it. In a sense, it’s already too late. Einat Orr presents a new strategy for data deployment, one where new data can be added in isolation, then tested and validated, before “going live” in a production table. She'll also demonstrate how data versioning tools like lakeFS and Project Nessie can support this deployment method in a seamless way with zero copy operations.
- Einat Orr is the cofounder and CEO of Treeverse, the company behind lakeFS, an open source platform that delivers a Git-like experience to object-storage-based data lakes. Einat previously led several engineering organizations, most recently as the CTO of SimilarWeb. She holds a PhD in mathematics in the field of optimization in graph theory from Tel Aviv University.
- Break (10 minutes)
Wannes Rosiers: Implementing a Data Mesh (30 minutes) - 9:35am PT | 12:35pm ET | 5:35pm UTC/GMT
- In the landscape of data architectures and models, the data mesh is the new kid on the block. But what exactly is it? Wannes Rosiers offers an overview of this new organizational approach to addressing data ownership issues, then discusses the problems that the data mesh is trying to solve, its real-world impact, and how to implement it for your organization.
- Wannes Rosiers is the CTO of Golazo. An experienced IT and data manager, previously he was the head of data engineering and news personalization at DPG Media, where he was a strong advocate for the organization's implementation of the data mesh paradigm.
Jessica Larson: Enabling Data Privacy and Access Control in Your Cloud Data Warehouse (30 minutes) - 10:05am PT | 1:05pm ET | 6:05pm UTC/GMT
- Consumers are becoming increasingly concerned about data privacy. As a result, we’re seeing new regulations crop up around the usage, storage, and access of sensitive personal data. To comply with these regulations, organizations must secure data from its point of origin through all of the intermediary systems and services to the final presentation layer. This involves multiple tensions between securing data and developer velocity, and between providing granular access controls and the administrative work involved. Jessica Larson reviews best practices for implementing access control to maintain data privacy in your cloud data warehouse, as well as strategies around using your data warehouse as the source of truth for data access privileges in downstream systems.
- Jessica Larson is a data engineer at Pinterest, where she served as the first engineer on the enterprise data warehouse team. She’s also writing a book on Snowflake access control, to be published in the spring of 2022. Previously, she was a data engineer at Eaze and Flexport. Jessica thrives on mentorship, solving data puzzles, and equipping colleagues with new technical skills. She’s also passionate about helping women and nonbinary people find their place in the technology industry.
- Break (5 minutes)
Ryan Blue: Solving Data Lake Challenges with Apache Iceberg (30 minutes) - 10:40am PT | 1:40pm ET | 6:40pm UTC/GMT
- Apache Iceberg—an open source table format for huge analytical datasets—has become an industry standard for storing data in object stores and distributed file systems. In addition to ensuring the correctness of your data, Iceberg allows you to substantially simplify existing architectures as well as unlock fundamentally new use cases on top of data lakes. Join Ryan Blue to explore how Iceberg uniquely solves challenges that data practitioners encounter in their daily work and learn why data engineers from companies like Apple and Stripe are using this tool.
- Ryan Blue is the cocreator of Apache Iceberg and the CEO of Tabular. Ryan’s spent the last decade working on big data formats and infrastructure at Netflix and Cloudera. He’s an ASF member and committer on Apache Parquet, Avro, and Spark.
Tejas Chopra: Architecting for Cloud Data Lakes—Compression, Deduplication, and Encryption (30 minutes) - 11:10am PT | 2:10pm ET | 7:10pm UTC/GMT
- Cloud data lake footprints are now measured in exabytes and are growing exponentially, with companies paying billions of dollars to store and retrieve data. While compression, deduplication, and encryption can be used to reduce the amount of storage used by applications, employing these techniques for objects stored in the cloud can be challenging due to the nature of overwrites and versioning. Tejas Chopra demonstrates how space and time optimizations (those that have historically been applied to on-premises file storage) can be applied to objects stored in cloud data lakes. You’ll learn strategies for employing compression, deduplication, and encryption techniques, then see how companies like Netflix successfully employ a subset of these techniques to reduce their cloud footprint and provide agility in their cloud operations.
- Tejas Chopra is a senior software engineer working in the data storage platform team at Netflix, where he’s responsible for architecting storage solutions to support Netflix Studios and the Netflix streaming platform. Previously, he helped design and implement the storage infrastructure at Box. Tejas is an international keynote speaker and periodically conducts seminars on microservices, NFTs, software development, and cloud computing.
Alistair Croll: Closing Remarks (5 minutes) - 11:40am PT | 2:40pm ET | 7:40pm UTC/GMT
- Alistair Croll closes out today’s event.
Upcoming Data Superstream events:
- Analytics Engineering - May 25, 2022
- Building Data Pipelines and Connectivity - August 10, 2022
Your Host
Alistair Croll
Alistair Croll is an entrepreneur, author, and conference organizer. He's written four books on technology and society, including the best-selling Lean Analytics, which has been translated into eight languages. He's the cofounder of web performance startup Coradiant (acquired by BMC), the Year One Labs startup accelerator, and a number of other early-stage companies. A prolific speaker, Alistair was a visiting executive at Harvard Business School, where he helped create a course on data science and critical thinking. He's founded and chaired a number of the world's leading technology events, including Cloud Connect, Strata, Startupfest, Scaletech, and the FWD50 Digital Government conference. He's currently working on Just Evil Enough, the subversive marketing playbook. Alistair lives in Montreal, Canada, and writes at acroll.substack.com.