book

Data Engineering with Python

by Paul Crickard

October 2020

Beginner to intermediate

356 pages

6h 50m

English

Packt Publishing

Read now

Unlock full access

Why subscribe?ContributorsAbout the authorAbout the reviewersPackt is searching for authors like you
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchReviews
What data engineers doRequired skills and knowledge to be a data engineer Data engineering versus data scienceData engineering toolsProgramming languagesDatabasesData processing enginesData pipelinesSummary
Installing and configuring Apache NiFiA quick tour of NiFiPostgreSQL driverInstalling and configuring Apache AirflowInstalling and configuring ElasticsearchInstalling and configuring KibanaInstalling and configuring PostgreSQLInstalling pgAdmin 4A tour of pgAdmin 4Summary
Writing and reading files in PythonWriting and reading CSVsReading and writing CSVs using pandas DataFramesWriting JSON with PythonBuilding data pipelines in Apache AirflowHandling files using NiFi processorsWorking with CSV in NiFiWorking with JSON in NiFiSummary
Inserting and extracting relational data in PythonInserting data into PostgreSQLInserting and extracting NoSQL database data in PythonInstalling ElasticsearchInserting data into ElasticsearchBuilding data pipelines in Apache AirflowSetting up the Airflow boilerplateRunning the DAGHandling databases with NiFi processorsExtracting data from PostgreSQLRunning the data pipelineSummary
Performing exploratory data analysis in PythonDownloading the dataBasic data explorationHandling common data issues using pandas Drop rows and columnsCreating and modifying columnsEnriching data Cleaning data using AirflowSummary
Building the data pipelineMapping a data typeTriggering a pipelineQuerying SeeClickFixTransforming the data for ElasticsearchGetting every pageBackfilling dataBuilding a Kibana dashboardCreating visualizationsCreating a dashboardSummary

Staging and validating dataStaging dataValidating data with Great ExpectationsBuilding idempotent data pipelinesBuilding atomic data pipelinesSummary
Installing and configuring the NiFi RegistryInstalling the NiFi RegistryConfiguring the NiFi RegistryUsing the Registry in NiFiAdding the Registry to NiFi Versioning your data pipelinesUsing git-persistence with the NiFi RegistrySummary
Monitoring NiFi using the GUIMonitoring NiFi with the status barMonitoring NiFi with processorsUsing Python with the NiFi REST APISummary
Finalizing your data pipelines for productionBackpressureImproving processor groupsUsing the NiFi variable registryDeploying your data pipelinesUsing the simplest strategyUsing the middle strategyUsing multiple registriesSummary
Creating a test and production environmentCreating the databasesPopulating a data lakeBuilding a production data pipelineReading the data lakeScanning the data lakeInserting the data into stagingQuerying the staging databaseValidating the staging dataInsert WarehouseDeploying a data pipeline in productionSummary
Creating ZooKeeper and Kafka clustersDownloading Kafka and setting up the environmentConfiguring ZooKeeper and KafkaStarting the ZooKeeper and Kafka clustersTesting the Kafka clusterTesting the cluster with messagesSummary
Understanding logsUnderstanding how Kafka uses logsTopicsKafka producers and consumersBuilding data pipelines with Kafka and NiFiThe Kafka producerThe Kafka consumerDifferentiating stream processing from batch processingProducing and consuming with PythonWriting a Kafka producer in PythonWriting a Kafka consumer in PythonSummary
Installing and running SparkInstalling and configuring PySparkProcessing data with PySparkSpark for data engineeringSummary
Setting up MiNiFiBuilding a MiNiFi task in NiFiSummary
Building a NiFi clusterThe basics of NiFi clusteringBuilding a NiFi cluster Building a distributed data pipelineManaging the distributed data pipelineSummary
Leave a review - let other readers know what you think

Content preview from Data Engineering with Python

Chapter 12: Building a Kafka Cluster

In this chapter, you will move beyond batch processing – running queries on a complete set of data – and learn about the tools used in stream processing. In stream processing, the data may be infinite and incomplete at the time of a query. One of the leading tools in handling streaming data is Apache Kafka. Kafka is a tool that allows you to send data in real time to topics. These topics can be read by consumers who process the data. This chapter will teach you how to build a three-node Apache Kafka cluster. You will also learn how to create and send messages (produce) and read data from topics (consume).

In this chapter, we're going to cover the following main topics: