O'Reilly logo
live online training icon Live Online training

Apache Hadoop, Spark, and Kafka Foundations: Effective Data Pipelines

Learn How the Major Tools in the Scalable Analytics Ecosystems Interoperate

Topic: Data
Douglas Eadline

The course will cover the essential introductory aspects of Hadoop, Spark, Kafka, and Big Data. A concise and essential overview of the Hadoop, Spark, and Kafka ecosystem will be presented. After completing the workshop attendees will gain a workable understanding of the Hadoop/Spark/Kafka value proposition for their organization and a clear background on scalable Big Data technologies and effective data pipelines.

What you'll learn-and how you can apply it

  • Understand Hadoop as a data platform
  • Learn how the "Data Lake" and Big Data are changing data analytics
  • Understand the basic differences and similarities between Hadoop, Spark, and Kafka
  • Navigate market congestion and understand how these technologies can work for their organization
  • Developer types can build on a solid foundation and learn how to use various tools mentioned in the presentation in follow-up courses

This training course is for you because...

  • CIO and other managers who need to "get up to speed" quickly on scalable big data technologies
  • Developers or Administrators (devops) wanting to learn how all the key pieces of the have Hadoop, Spark, and Kafka ecosystem fit together
  • Data Scientists that do not have experience with scalable tools like Hadoop, Spark, or Kafka

Prerequisites

  • Basic understanding of data center operations (servers, storage, networks, database).

Recommended Preparation:

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Why is Hadoop Such a Big Deal? (45 mins)

  • A Brief History of Apache Hadoop
  • What is Big Data?
  • Hadoop as a Data Lake
  • Apache Hadoop V2 is a Platform
  • The Apache Hadoop Project Ecosystem

Segment 2: Hadoop Distributed File System (HDFS) Basics (25 mins)

  • How HDFS works

Segment 3: Hadoop MapReduce Framework (25 mins)

  • The MapReduce Model
  • MapReduce Data Flow
  • Break (10 mins)

Segment 4: Making life Easier: Spark (30)

  • Apache Spark Basics and Components
  • Spark RRDs and Dataframes
  • Spark vs MapReduce

Segment 5: The Kafka Data Sponge (30)

  • Why Do We Need Apache Kafka?
  • How Kafka Operates (Producers/Consumers)
  • Integration with Hadoop and Spark Pipelines

Segment 6: Real World Applications/Wrap-up (15 Min)

  • Successful Use Cases
  • Course Takeaways