book

The Pandas Workshop

Name: The Pandas Workshop
ISBN: 9781800208933

by Blaine Bateman, Saikat Basak, Thomas Joseph, William So

June 2022

Beginner to intermediate

744 pages

15h 44m

English

Packt Publishing

Read now

Unlock full access

The Pandas Workshop
ContributorsAbout the authorsAbout the reviewer
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare Your Thoughts
Part 1 – Introduction to pandas
Chapter 1: Introduction to pandas
Introduction to the world of pandasExploring the history and evolution of pandasComponents and applications of pandasUnderstanding the basic concepts of pandasThe Series objectThe DataFrame objectWorking with local filesReading a CSV fileDisplaying a snapshot of the dataWriting data to a fileData types in pandasData selectionData transformationData visualizationTime series dataCode optimizationUtility functionsExercise 1.02 – basic numerical operations with pandasData modelingExercise 1.03 – comparing data from two DataFramesActivity 1.01 – comparing sales data for two storesSummary
Chapter 2: Working with Data Structures
Introduction to data structuresThe need for data structuresData structuresCreating DataFrames in pandasExercise 2.01 – Creating a DataFrameIndexes and columnsExercise 2.02 – Reading DataFrames and manipulating the indexWorking with columnsSeriesThe Series indexExercise 2.03 – Series to DataFramesUsing time as the indexExercise 2.04 – DataFrame indicesActivity 2.01 – Working with pandas data structuresSummary
Chapter 3: Data I/O
The world of dataExploring data sourcesText files and binary filesOnline data sourcesExercise 3.01 – reading data from web pagesFundamental formatsText dataExercise 3.02 – text character encoding and data separatorsBinary dataDatabases – SQL datasqlite3Additional text formatsWorking with JSONWorking with HTML/XMLWorking with XML dataWorking with ExcelSAS dataSPSS dataStata dataHDF5 dataManipulating SQL dataExercise 3.03 – working with SQLChoosing a format for a projectActivity 3.01 – using SQL data for pandas analyticsSummary
Chapter 4: Pandas Data Types
Introducing pandas dtypesObtaining the underlying data typesConverting from one type into anotherExercise 4.01 – underlying data types and conversionMissing data typesThe missing alphabet soupNullable typesExercise 4.02 – missing data and converting into non-nullable dtypesActivity 4.01 – optimizing memory usage by converting into the appropriate dtypesSubsetting by data typesWorking with the dtype category Working with dtype = datetime64[ns]Working with dtype = timedelta64[ns]Exercise 4.03 – working with text data using string methodsSelecting data in a DataFrame by its dtypeSummary
Part 2 – Working with Data
Chapter 5: Data Selection – DataFrames
Introduction to DataFramesThe need for data selection methodsData selection in pandas DataFramesThe index and its formsExercise 5.01 – identifying the row and column indices in a datasetSlicing and indexing methodsExercise 5.02 – subsetting rows and columnsUsing labels as the index and the pandas multi-indexCreating a multi-index from columnsActivity 5.01 – Creating a multi-index from columnsBracket and dot notationBracket notation Dot notationExercise 5.03 – integer row numbers versus labelsUsing extended indexingType exceptionsChanging DataFrame values using bracket or dot notationExercise 5.04 – selecting data using bracket and dot notationSummary
Chapter 6: Data Selection – Series
Introduction to pandas SeriesThe Series indexData selection in a pandas SeriesBrackets, dots, Series.loc, and Series.ilocExercise 6.01 – basic Series data selectionPreparing Series from DataFrames and vice versaExercise 6.02 – using a Series index to select valuesActivity 6.01 – Series data selectionUnderstanding the differences between base Python and pandas data selectionLists versus Series accessDataFrames versus dictionary accessActivity 6.02 – DataFrame data selectionSummary

Chapter 7: Data Exploration and Transformation
Introduction to data transformationDealing with messy dataWorking on data without column headersMultiple values in one columnDuplicate observations in both rows and columnsExercise 7.01 – working with messy addressesMultiple variables stored in one columnMultiple DataFrames with identical structuresExercise 7.02 – storing sales by demographicsDealing with missing dataWhat is missing data?Strategies for missing dataSummarizing dataGrouping and aggregationExploring pivot tablesActivity 7.01 – data analysis using pivot tablesSummary
Chapter 8: Understanding Data Visualization
Introduction to data visualizationUnderstanding the basics of pandas visualizationExercise 8.01 – Building histograms for the Titanic datasetExploring matplotlibVisualizing data of different typesVisualizing numerical dataVisualizing categorical dataVisualizing statistical dataExercise 8.02 – Boxplots for the Titanic datasetVisualizing multiple data plotsActivity 8.01 – Using data visualization for exploratory data analysisSummary
Part 3 – Data Modeling
Chapter 9: Data Modeling – Preprocessing
An introduction to data modelingExploring dependent and independent variablesTraining, validation, and test splits of dataExercise 9.01 – Creating training, validation, and test dataAvoiding information leakageComplete model validationUnderstanding data scaling and normalizationDifferent ways to Scale DataScaling data yourselfMin/max scalingStandardization – addressing varianceTransforming back to real unitsExercise 9.02 – Scaling and normalizing dataActivity 9.01 – Data splitting, scaling, and modelingSummary
Chapter 10: Data Modeling – Modeling Basics
Introduction to data modelingLearning the modeling basicsModeling toolsPandas modeling toolsPredicting future values of time seriesExercise 10.01 – Smoothing data to discover patternsActivity 10.01 – Normalizing and smoothing dataSummary
Chapter 11: Data Modeling – Regression Modeling
An introduction to regression modelingExploring regression modelingUsing linear modelsExercise 11.1 – Linear regressionNon-linear modelsModel diagnosticsComparing predicted and actual valuesUsing the Q-Q plotExercise 11.02 – Multiple regression and non-linear modelsActivity 11.01 – Multiple regression with non-linear modelsSummary
Part 4 – Additional Use Cases for pandas
Chapter 12: Using Time in pandas
Introduction to time seriesWhat are datetimes?Attributes of datetime objectsExercise 12.01 – working with datetimeCreating and manipulating datetime objects/time seriesTime periods in pandasInformation in pandas time-aware objectsExercise 12.02 – math with datetimesTimestamp formats Activity 12.01 – understanding power usageDatetime math operationsDate ranges Timedeltas, offsets, and differences Date offsets Exercise 12.03 – timedeltas and date offsetsSummary
Chapter 13: Exploring Time Series
The time series as an indexTime series periods/frequencies Shifting, lagging, and converting frequency Resampling, grouping, and aggregation by timeUsing the resample method Exercise 13.01 – Aggregating and resamplingWindowing operations with the rolling methodActivity 13.01 – Creating a time series modelSummary
Chapter 14: Applying pandas Data Processing for Case Studies
Introduction to the case studies and datasetsRecap of the preprocessing stepsPreprocessing the German climate dataExercise 14.01 – preprocessing the German climate dataExercise 14.02 – merging DataFrames and renaming variablesExercise 14.03 – data interpolation and answering questions after data preprocessingExercise 14.04 – using data visualizations to answer questionsExercise 14.05 – using data visualizations to answer questionsExercise 14.06 – analyzing data on bus trajectoriesActivity 14.01 – analyzing air quality dataSummary
Chapter 15: Appendix
Solution 1.1Solution 2.1Solution 3.1Solution 4.1Solution 5.1Solution 6.1Solution 6.2Solution 7.1Solution 8.1Solution 9.1Solution 10.1Solution 11.1Solution 12.1Solution 13.1Solution 14.1
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your Thoughts

Content preview from The Pandas Workshop

Chapter 9: Data Modeling – Preprocessing

In this chapter, you will learn two important processes used to prepare data for modeling – splitting and scaling. You will learn how to use the sklearn methods – .StandardScaler and .MinMaxScaler for scaling, and .train_test_split for splitting. You will also be introduced to the reasons behind scaling and exactly what these methods do. As part of exploring splitting and scaling, you will use sklearn LinearRegression and statsmodels to create simple linear regression models.

By the end of this chapter, you will be comfortable preparing datasets to begin modeling. The main ideas you will learn in this chapter are as follows:

Exploring independent and dependent variables
Understanding data scaling and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781800208933

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Pandas Workshop

by Blaine Bateman, Saikat Basak, Thomas Joseph, William So

Chapter 9: Data Modeling – Preprocessing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.