book

Data Engineering with Python

Name: Data Engineering with Python
Author: Paul Crickard
ISBN: 9781839214189

by Paul Crickard

October 2020

Beginner to intermediate

356 pages

6h 50m

English

Packt Publishing

Read now

Unlock full access

Data Engineering with Python
Why subscribe?ContributorsAbout the authorAbout the reviewersPackt is searching for authors like you
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchReviews
Section 1: Building Data Pipelines – Extract Transform, and Load
Chapter 1: What is Data Engineering?
What data engineers doRequired skills and knowledge to be a data engineer Data engineering versus data scienceData engineering toolsProgramming languagesDatabasesData processing enginesData pipelinesSummary
Chapter 2: Building Our Data Engineering Infrastructure
Installing and configuring Apache NiFiA quick tour of NiFiPostgreSQL driverInstalling and configuring Apache AirflowInstalling and configuring ElasticsearchInstalling and configuring KibanaInstalling and configuring PostgreSQLInstalling pgAdmin 4A tour of pgAdmin 4Summary
Chapter 3: Reading and Writing Files
Writing and reading files in PythonWriting and reading CSVsReading and writing CSVs using pandas DataFramesWriting JSON with PythonBuilding data pipelines in Apache AirflowHandling files using NiFi processorsWorking with CSV in NiFiWorking with JSON in NiFiSummary
Chapter 4: Working with Databases
Inserting and extracting relational data in PythonInserting data into PostgreSQLInserting and extracting NoSQL database data in PythonInstalling ElasticsearchInserting data into ElasticsearchBuilding data pipelines in Apache AirflowSetting up the Airflow boilerplateRunning the DAGHandling databases with NiFi processorsExtracting data from PostgreSQLRunning the data pipelineSummary
Chapter 5: Cleaning, Transforming, and Enriching Data
Performing exploratory data analysis in PythonDownloading the dataBasic data explorationHandling common data issues using pandas Drop rows and columnsCreating and modifying columnsEnriching data Cleaning data using AirflowSummary
Chapter 6: Building a 311 Data Pipeline
Building the data pipelineMapping a data typeTriggering a pipelineQuerying SeeClickFixTransforming the data for ElasticsearchGetting every pageBackfilling dataBuilding a Kibana dashboardCreating visualizationsCreating a dashboardSummary
Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline
Staging and validating dataStaging dataValidating data with Great ExpectationsBuilding idempotent data pipelinesBuilding atomic data pipelinesSummary
Chapter 8: Version Control with the NiFi Registry
Installing and configuring the NiFi RegistryInstalling the NiFi RegistryConfiguring the NiFi RegistryUsing the Registry in NiFiAdding the Registry to NiFi Versioning your data pipelinesUsing git-persistence with the NiFi RegistrySummary
Chapter 9: Monitoring Data Pipelines
Monitoring NiFi using the GUIMonitoring NiFi with the status barMonitoring NiFi with processorsUsing Python with the NiFi REST APISummary
Chapter 10: Deploying Data Pipelines
Finalizing your data pipelines for productionBackpressureImproving processor groupsUsing the NiFi variable registryDeploying your data pipelinesUsing the simplest strategyUsing the middle strategyUsing multiple registriesSummary
Chapter 11: Building a Production Data Pipeline
Creating a test and production environmentCreating the databasesPopulating a data lakeBuilding a production data pipelineReading the data lakeScanning the data lakeInserting the data into stagingQuerying the staging databaseValidating the staging dataInsert WarehouseDeploying a data pipeline in productionSummary
Section 3:Beyond Batch – Building Real-Time Data Pipelines
Chapter 12: Building a Kafka Cluster
Creating ZooKeeper and Kafka clustersDownloading Kafka and setting up the environmentConfiguring ZooKeeper and KafkaStarting the ZooKeeper and Kafka clustersTesting the Kafka clusterTesting the cluster with messagesSummary
Chapter 13: Streaming Data with Apache Kafka
Understanding logsUnderstanding how Kafka uses logsTopicsKafka producers and consumersBuilding data pipelines with Kafka and NiFiThe Kafka producerThe Kafka consumerDifferentiating stream processing from batch processingProducing and consuming with PythonWriting a Kafka producer in PythonWriting a Kafka consumer in PythonSummary
Chapter 14: Data Processing with Apache Spark
Installing and running SparkInstalling and configuring PySparkProcessing data with PySparkSpark for data engineeringSummary
Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark
Setting up MiNiFiBuilding a MiNiFi task in NiFiSummary
Appendix
Building a NiFi clusterThe basics of NiFi clusteringBuilding a NiFi cluster Building a distributed data pipelineManaging the distributed data pipelineSummary
Other Books You May Enjoy
Leave a review - let other readers know what you think

Content preview from Data Engineering with Python

Chapter 7: Features of a Production Pipeline

In this chapter, you will learn several features that make a data pipeline ready for production. You will learn about building data pipelines that can be run multiple times without changing the results (idempotent). You will also learn what to do if transactions fail (atomicity). And you will learn about validating data in a staging environment. This chapter will use a sample data pipeline that I currently run in production.

For me, this pipeline is a bonus, and I am not concerned with errors, or missing data. Because of this, there are elements missing in this pipeline that should be present in a mission critical, or production, pipeline. Every data pipeline will have different acceptable rates of ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781839214189

Cloud Computing