book

Python Data Cleaning and Preparation Best Practices

Name: Python Data Cleaning and Preparation Best Practices
Author: Maria Zervou
ISBN: 9781837634743

by Maria Zervou

September 2024

Beginner to intermediate

456 pages

11h 53m

English

Packt Publishing

Read now

Unlock full access

Python Data Cleaning and Preparation Best Practices
Contributors
About the author
About the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesConventions usedGet in touchShare your thoughtsDownload a free PDF copy of this book
Part 1: Upstream Data Ingestion and Cleaning
Chapter 1: Data Ingestion Techniques
Technical requirementsIngesting data in batch modeAdvantages and disadvantagesCommon use cases for batch ingestionBatch ingestion use casesBatch ingestion with an exampleIngesting data in streaming modeAdvantages and disadvantagesCommon use cases for streaming ingestionStreaming ingestion in an e-commerce platformStreaming ingestion with an exampleReal-time versus semi-real-time ingestionCommon use cases for near-real-time ingestionSemi-real-time mode with an exampleData source solutionsEvent data processing solutionIngesting event data with Apache KafkaIngesting data from databasesPerforming data ingestion from cloud-based file systemsAPIsSummary
Chapter 2: Importance of Data Quality
Technical requirementsWhy data quality is importantDimensions of data qualityCompletenessAccuracyTimelinessConsistencyUniquenessDuplicationData usageData complianceImplementing quality controls throughout the data life cycleData silos and the impact on data qualitySummary
Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution
Technical requirementsUnderstanding data profilingIdentifying goals of data profilingExploratory data analysis options – profiler versus manualProfiling data with pandas’ ydata_profilingOverviewInteractionsCorrelationsMissing valuesDuplicate rowsSample datasetProfiling high volumes of data with the pandas data profilerData validation with the Great Expectations libraryConfiguring Great Expectations for your projectCreate your first Great Expectations data sourceCreating your first Great Expectations suiteGreat Expectations Suite reportManually edit Great ExpectationsCheckpointsUsing pandas profiler to build your Great Expectations SuiteComparing Great Expectations and pandas profiler – when to use whatGreat Expectations and big dataSummary
Chapter 4: Cleaning Messy Data and Data Manipulation
Technical requirementsRenaming columnsRenaming a single columnRenaming all columnsRemoving irrelevant or redundant columnsDealing with inconsistent and incorrect data typesInspecting columnsColumnar type transformationsConverting to numeric typesConverting to string typesConverting to categorical typesConverting to Boolean typesWorking with dates and timesImporting and parsing date and time dataExtracting components from dates and timesCalculating time differences and durationsHandling time zones and daylight saving timeSummary
Chapter 5: Data Transformation – Merging and Concatenating
Technical requirementsJoining datasetsChoosing the correct merge strategyHandling duplicates when merging datasetsWhy handle duplication in rows and columns?Dropping duplicate rowsValidating data before mergingAggregationConcatenationHandling duplication in columnsPerformance tricks for mergingSet indexesSorting indexesMerge versus joinConcatenating DataFramesRow-wise concatenationColumn-wise concatenationSummaryReferences

Chapter 6: Data Grouping, Aggregation, Filtering, and Applying Functions
Technical requirementsGrouping data using one or multiple keysGrouping data using one keyGrouping data using multiple keysBest practices for groupingApplying aggregate functions on grouped dataBasic aggregate functionsAdvanced aggregation with multiple columnsApplying custom aggregate functionsBest practices for aggregate functionsUsing the apply function on grouped dataData filteringMultiple criteria for filteringBest practices for filteringPerformance considerations as data growsSummary
Chapter 7: Data Sinks
Technical requirementsChoosing the right data sink for your use caseRelational databasesNoSQL databasesData warehousesData lakesStreaming data sinksWhich sink is the best for my use case?Decoding file types for optimal usageNavigating partitioningHorizontal versus vertical partitioningTime-based partitioningGeographic partitioningHybrid partitioningConsiderations for choosing partitioning strategiesDesigning an online retail data platformSummary
Part 2: Downstream Data Cleaning – Consuming Structured Data
Chapter 8: Detecting and Handling Missing Values and Outliers
Technical requirementsDetecting missing dataHandling missing dataDeletion of missing dataImputation of missing dataMean imputationMedian imputationCreating indicator variablesComparison between imputation methodsDetecting and handling outliersImpact of outliersIdentifying univariate outliersHandling univariate outliersIdentifying multivariate outliersHandling multivariate outliersSummary
Chapter 9: Normalization and Standardization
Technical requirementsScaling features to a rangeMin-max scalingZ-score scalingWhen to use Z-score scalingRobust scalingComparison between methodsSummary
Chapter 10: Handling Categorical Features
Technical requirementsLabel encodingUse case – employee performance analysisConsiderations for label encodingOne-hot encodingWhen to use one-hot encodingUse case – customer churn predictionConsiderations for one-hot encodingTarget encoding (mean encoding)When to use target encodingUse case – sales prediction for retail storesConsiderations for target encodingFrequency encodingWhen to use frequency encodingUse case – customer product preference analysisConsiderations for frequency encodingBinary encodingWhen to use binary encodingUse case – customer subscription predictionConsiderations for binary encodingSummary
Chapter 11: Consuming Time Series Data
Technical requirementsUnderstanding the components of time series dataTrendSeasonalityNoiseTypes of time series dataUnivariate time series dataMultivariate time series dataIdentifying missing values in time series dataChecking for NaNs or null valuesVisual inspectionHandling missing values in time series dataRemoving missing dataForward and backward fillInterpolationComparing the different methods for missing valuesAnalyzing time series dataAutocorrelation and partial autocorrelationACT and PACF in the stock market use caseDealing with outliersIdentifying outliers with seasonal decompositionHandling outliers – model-based approaches – ARIMAMoving window techniquesFeature engineering for time series dataLag features and their importanceDifferencing time seriesApplying time series techniques in different industriesSummary
Part 3: Downstream Data Cleaning – Consuming Unstructured Data
Chapter 12: Text Preprocessing in the Era of LLMs
Technical requirementsRelearning text preprocessing in the era of LLMsText cleaningRemoving HTML tags and special charactersHandling capitalization and letter caseDealing with numerical values and symbolsAddressing whitespace and formatting issuesRemoving personally identifiable informationHandling rare words and spelling variationsDealing with rare wordsAddressing spelling variations and typosChunkingTokenizationWord tokenizationSubword tokenizationDomain-specific dataTurning tokens into embeddingsBERT – Contextualized Embedding ModelsBGEGTESelecting the right embedding modelSolving real problems with embeddingsSummary
Chapter 13: Image and Audio Preprocessing with LLMs
Technical requirementsThe current era of image preprocessingLoading the imagesResizing and croppingNormalizing and standardizing the datasetData augmentationNoise reductionExtracting text from imagesPaddleOCRUsing LLMs with OCRCreating image captionsHandling audio dataUsing Whisper for audio-to-text conversionExtracting text from audioFuture research in audio preprocessingSummaryThis concludes the book! You did it!
Index
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare your thoughtsDownload a free PDF copy of this book

Content preview from Python Data Cleaning and Preparation Best Practices

13 Image and Audio Preprocessing with LLMs

In this chapter, we delve into the preprocessing of unstructured data, specifically focusing on images and audio. We explore various techniques and models designed to extract meaningful information from these types of media. The discussion includes a detailed examination of image preprocessing methods, the use of optical character recognition (OCR) for extracting text from images, the capabilities of the BLIP model for generating image captions, and the application of the Whisper model for converting audio into text.

In this chapter, we’ll cover the following topics:

The current era of image preprocessing
Extracting text from images
Handling audio data

Technical requirements

The complete code for ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical Python Data Wrangling and Data Quality

Publisher Resources

ISBN: 9781837634743

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python Data Cleaning and Preparation Best Practices

by Maria Zervou

13

Image and Audio Preprocessing with LLMs

Technical requirements

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.