book

Big Data on Kubernetes

Name: Big Data on Kubernetes
Author: Neylson Crepalde
ISBN: 9781835462140

by Neylson Crepalde

July 2024

Intermediate to advanced

296 pages

7h 4m

English

Packt Publishing

Read now

Unlock full access

Big Data on Kubernetes
Contributors
About the authorAbout the reviewer
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesConventions usedGet in touchShare Your ThoughtsDownload a free PDF copy of this book
Part 1:Docker and Kubernetes
Chapter 1: Getting Started with Containers
Technical requirementsContainer architectureInstalling DockerWindowsmacOSLinuxGetting started with Docker imageshello-worldNGINXJuliaBuilding your own imageBatch processing jobAPI serviceSummary
Chapter 2: Kubernetes Architecture
Technical requirementsKubernetes architectureControl planeNode componentsPodsDeploymentsStatefulSetsJobsServicesClusterIP ServiceNodePort ServiceLoadBalancer ServiceIngress and Ingress ControllerGatewayPersistent VolumesStorageClassesConfigMaps and SecretsConfigMapsSecretsSummary
Chapter 3: Getting Hands-On with Kubernetes
Technical requirementsInstalling kubectlDeploying a local cluster using KindInstalling kindDeploying the clusterDeploying an AWS EKS clusterDeploying a Google Cloud GKE clusterDeploying an Azure AKS clusterRunning your API on KubernetesCreating the deploymentCreating a serviceUsing an ingress to access the APIRunning a data processing job in KubernetesSummary
Part 2: Big Data Stack
Chapter 4: The Modern Data Stack
Data architecturesThe Lambda architectureThe Kappa architectureComparing Lambda and KappaData lake design for big dataData warehousesThe rise of big data and data lakesThe rise of the data lakehouseImplementing the lakehouse architectureBatch ingestionStorageBatch processingOrchestrationBatch servingData visualizationReal-time ingestionReal-time processingReal-time servingReal-time data visualizationSummary
Chapter 5: Big Data Processing with Apache Spark
Technical requirementsGetting started with SparkInstalling Spark locallySpark architectureSpark executorsComponents of executionStarting a Spark programThe DataFrame API and the Spark SQL APITransformationsActionsLazy evaluationData partitioningNarrow versus wide transformationsAnalyzing the titanic datasetWorking with real dataHow Spark performs joinsJoining IMDb tablesSummary

Chapter 6: Building Pipelines with Apache Airflow
Technical requirementsGetting started with AirflowInstalling Airflow with AstroAirflow architectureAirflow’s distributed architectureBuilding a data pipelineAirflow integration with other toolsSummary
Chapter 7: Apache Kafka for Real-Time Events and Data Ingestion
Technical requirementsGetting started with KafkaExploring the Kafka architectureThe PubSub designHow Kafka delivers exactly-once semanticsFirst producer and consumerStreaming from a database with Kafka ConnectReal-time data processing with Kafka and SparkSummary
Part 3: Connecting It All Together
Chapter 8: Deploying the Big Data Stack on Kubernetes
Technical requirementsDeploying Spark on KubernetesDeploying Airflow on KubernetesDeploying Kafka on KubernetesSummary
Chapter 9: Data Consumption Layer
Technical requirementsGetting started with SQL query enginesThe limitations of traditional data warehousesThe rise of SQL query enginesThe architecture of SQL query enginesDeploying Trino in KubernetesConnecting DBeaver with TrinoDeploying Elasticsearch in KubernetesHow Elasticsearch stores, indexes and manages dataElasticsearch deploymentSummary
Chapter 10: Building a Big Data Pipeline on Kubernetes
Technical requirementsChecking the deployed toolsBuilding a batch pipelineBuilding the Airflow DAGCreating SparkApplication jobsCreating a Glue crawlerBuilding a real-time pipelineDeploying Kafka Connect and ElasticsearchReal-time processing with SparkDeploying the Elasticsearch sink connectorSummary
Chapter 11: Generative AI on Kubernetes
Technical requirementsWhat generative AI is and what it is notThe power of large neural networksChallenges and limitationsUsing Amazon Bedrock to work with foundational modelsBuilding a generative AI application on KubernetesDeploying the Streamlit appBuilding RAG with Knowledge Bases for Amazon BedrockAdjusting the code for RAG retrievalBuilding action models with agentsCreating a DynamoDB tableConfiguring the agentDeploying the application on KubernetesSummary
Chapter 12: Where to Go from Here
Important topics for big data in KubernetesKubernetes monitoring and application monitoringBuilding a service meshSecurity considerationsAutomated scalabilityGitOps and CI/CD for KubernetesKubernetes cost controlWhat about team skills?Key skills for monitoringBuilding a service meshSecurity considerationsAutomated scalabilitySkills for GitOps and CI/CDCost control skillsSummary
Index
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your ThoughtsDownload a free PDF copy of this book

Content preview from Big Data on Kubernetes

8 Deploying the Big Data Stack on Kubernetes

In this chapter, we will cover the deployment of key big data technologies – Spark, Airflow, and Kafka – on Kubernetes. As container orchestration and management have become critical for running data workloads efficiently, Kubernetes has emerged as the de facto standard. By the end of this chapter, you will be able to successfully deploy and manage big data stacks on Kubernetes for building robust data pipelines and applications.

We will start by deploying Apache Spark on Kubernetes using the Spark operator. You will learn how to configure and monitor Spark jobs running as Spark applications on your Kubernetes cluster. Being able to run Spark workloads on Kubernetes brings important benefits such ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Managing Cloud Native Data on Kubernetes

Publisher Resources

ISBN: 9781835462140

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Big Data on Kubernetes

by Neylson Crepalde

8

Deploying the Big Data Stack on Kubernetes

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.