book

Data Engineering with Python

Name: Data Engineering with Python
Author: Paul Crickard
ISBN: 9781839214189

by Paul Crickard

October 2020

Beginner to intermediate

356 pages

6h 50m

English

Packt Publishing

Read now

Unlock full access

Data Engineering with Python
Why subscribe?ContributorsAbout the authorAbout the reviewersPackt is searching for authors like you
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchReviews
Section 1: Building Data Pipelines – Extract Transform, and Load
Chapter 1: What is Data Engineering?
What data engineers doRequired skills and knowledge to be a data engineer Data engineering versus data scienceData engineering toolsProgramming languagesDatabasesData processing enginesData pipelinesSummary
Chapter 2: Building Our Data Engineering Infrastructure
Installing and configuring Apache NiFiA quick tour of NiFiPostgreSQL driverInstalling and configuring Apache AirflowInstalling and configuring ElasticsearchInstalling and configuring KibanaInstalling and configuring PostgreSQLInstalling pgAdmin 4A tour of pgAdmin 4Summary
Chapter 3: Reading and Writing Files
Writing and reading files in PythonWriting and reading CSVsReading and writing CSVs using pandas DataFramesWriting JSON with PythonBuilding data pipelines in Apache AirflowHandling files using NiFi processorsWorking with CSV in NiFiWorking with JSON in NiFiSummary
Chapter 4: Working with Databases
Inserting and extracting relational data in PythonInserting data into PostgreSQLInserting and extracting NoSQL database data in PythonInstalling ElasticsearchInserting data into ElasticsearchBuilding data pipelines in Apache AirflowSetting up the Airflow boilerplateRunning the DAGHandling databases with NiFi processorsExtracting data from PostgreSQLRunning the data pipelineSummary
Chapter 5: Cleaning, Transforming, and Enriching Data
Performing exploratory data analysis in PythonDownloading the dataBasic data explorationHandling common data issues using pandas Drop rows and columnsCreating and modifying columnsEnriching data Cleaning data using AirflowSummary
Chapter 6: Building a 311 Data Pipeline
Building the data pipelineMapping a data typeTriggering a pipelineQuerying SeeClickFixTransforming the data for ElasticsearchGetting every pageBackfilling dataBuilding a Kibana dashboardCreating visualizationsCreating a dashboardSummary
Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline
Staging and validating dataStaging dataValidating data with Great ExpectationsBuilding idempotent data pipelinesBuilding atomic data pipelinesSummary
Chapter 8: Version Control with the NiFi Registry
Installing and configuring the NiFi RegistryInstalling the NiFi RegistryConfiguring the NiFi RegistryUsing the Registry in NiFiAdding the Registry to NiFi Versioning your data pipelinesUsing git-persistence with the NiFi RegistrySummary
Chapter 9: Monitoring Data Pipelines
Monitoring NiFi using the GUIMonitoring NiFi with the status barMonitoring NiFi with processorsUsing Python with the NiFi REST APISummary
Chapter 10: Deploying Data Pipelines
Finalizing your data pipelines for productionBackpressureImproving processor groupsUsing the NiFi variable registryDeploying your data pipelinesUsing the simplest strategyUsing the middle strategyUsing multiple registriesSummary
Chapter 11: Building a Production Data Pipeline
Creating a test and production environmentCreating the databasesPopulating a data lakeBuilding a production data pipelineReading the data lakeScanning the data lakeInserting the data into stagingQuerying the staging databaseValidating the staging dataInsert WarehouseDeploying a data pipeline in productionSummary
Section 3:Beyond Batch – Building Real-Time Data Pipelines
Chapter 12: Building a Kafka Cluster
Creating ZooKeeper and Kafka clustersDownloading Kafka and setting up the environmentConfiguring ZooKeeper and KafkaStarting the ZooKeeper and Kafka clustersTesting the Kafka clusterTesting the cluster with messagesSummary
Chapter 13: Streaming Data with Apache Kafka
Understanding logsUnderstanding how Kafka uses logsTopicsKafka producers and consumersBuilding data pipelines with Kafka and NiFiThe Kafka producerThe Kafka consumerDifferentiating stream processing from batch processingProducing and consuming with PythonWriting a Kafka producer in PythonWriting a Kafka consumer in PythonSummary
Chapter 14: Data Processing with Apache Spark
Installing and running SparkInstalling and configuring PySparkProcessing data with PySparkSpark for data engineeringSummary
Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark
Setting up MiNiFiBuilding a MiNiFi task in NiFiSummary
Appendix
Building a NiFi clusterThe basics of NiFi clusteringBuilding a NiFi cluster Building a distributed data pipelineManaging the distributed data pipelineSummary
Other Books You May Enjoy
Leave a review - let other readers know what you think

Overview

Discover the inner workings of data pipelines with 'Data Engineering with Python', a practical guide to mastering the art of data engineering. Through hands-on examples, you'll explore the process of designing data models, implementing data pipelines, and automating data flows, all within the context of Python.

What this Book will help me do

Understand the fundamentals of designing data architectures and capturing data requirements.
Extract, clean, and transform data from various sources, refining it for precise applications.
Implement end-to-end data pipelines, including staging, validation, and production deployment.
Leverage Python to connect with databases, perform data manipulations, and build analytics workflows.
Monitor and log data pipelines to ensure smooth, real-time operations and high quality.

Author(s)

Paul Crickard is a seasoned expert in data engineering and analytics, bringing years of practical experience to this technical guide. His unique ability to make complex technical concepts accessible makes this book invaluable for learners and professionals alike. A lifelong technologist, Paul focuses on actionable skills and building confidence to work with data pipelines and models.

Who is it for?

This book is ideal for aspiring data engineers, data analysts aiming to elevate their technical skillsets, or IT professionals transitioning into data-driven roles. Whether you're just stepping into the field or enhance your Python-based data capabilities, this book is tailored to provide solid grounding and practical expertise. Beginners in data engineering will find it accessible and easy to get started, while those refreshing their knowledge will benefit from its focused projects.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781839214189

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills