book

The Data Science Handbook, 2nd Edition

Name: The Data Science Handbook, 2nd Edition
Author: Field Cady
ISBN: 9781394234493

by Field Cady

December 2024

Beginner to intermediate

368 pages

11h 47m

English

Wiley

Read now

Unlock full access

Cover
Table of Contents
Title Page
Copyright Page
Dedication Page
Preface to the First Edition
Preface to the Second Edition
1 Introduction
1.1 What Data Science Is and Isn’t1.2 This Book’s Slogan: Simple Models Are Easier to Work With1.3 How Is This Book Organized?1.4 How to Use This Book?1.5 Why Is It All in Python, Anyway?1.6 Example Code and Datasets1.7 Parting Words
Part 1: The Stuff You’ll Always Use
2 The Data Science Road Map
2.1 Frame the Problem2.2 Understand the Data: Basic Questions2.3 Understand the Data: Data Wrangling2.4 Understand the Data: Exploratory Analysis2.5 Extract Features2.6 Model2.7 Present Results2.8 Deploy Code2.9 Iterating2.10 Glossary

3 Programming Languages
3.1 Why Use a Programming Language? What Are the Other Options?3.2 A Survey of Programming Languages for Data Science3.3 Where to Write Code3.4 Python Overview and Example Scripts3.5 Python Data Types3.6 GOTCHA: Hashable and Unhashable Types3.7 Functions and Control Structures3.8 Other Parts of Python3.9 Python’s Technical Libraries3.10 Other Python Resources3.11 Further Reading3.12 Glossary
Interlude: My Personal Toolkit
4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
4.1 The Worst Dataset in the World4.2 How to Identify Pathologies4.3 Problems with Data Content4.4 Formatting Issues4.5 Example Formatting Script4.6 Regular Expressions4.7 Life in the Trenches4.8 Glossary
5 Visualizations and Simple Metrics
5.1 A Note on Python’s Visualization Tools5.2 Example Code5.3 Pie Charts5.4 Bar Charts5.5 Histograms5.6 Means, Standard Deviations, Medians, and Quantiles5.7 Boxplots5.8 Scatterplots5.9 Scatterplots with Logarithmic Axes5.10 Scatter Matrices5.11 Heatmaps5.12 Correlations5.13 Anscombe’s Quartet and the Limits of Numbers5.14 Time Series5.15 Further Reading5.16 Glossary
6 Overview: Machine Learning and Artificial Intelligence
6.1 Historical Context6.2 The Central Paradigm: Learning a Function from Example6.3 Machine Learning Data: Vectors and Feature Extraction6.4 Supervised, Unsupervised, and In‐Between6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting6.6 Reinforcement Learning6.7 ML Models as Building Blocks for AI Systems6.8 ML Engineering as a New Job Role6.9 Further Reading6.10 Glossary
7 Interlude: Feature Extraction Ideas
7.1 Standard Features7.2 Features that Involve Grouping7.3 Preview of More Sophisticated Features7.4 You Get What You Measure: Defining the Target Variable
8 Machine‐Learning Classification
8.1 What Is a Classifier, and What Can You Do with It?8.2 A Few Practical Concerns8.3 Binary Versus Multiclass8.4 Example Script8.5 Specific Classifiers8.6 Evaluating Classifiers8.7 Selecting Classification Cutoffs8.8 Further Reading8.9 Glossary
9 Technical Communication and Documentation
9.1 Several Guiding Principles9.2 Slide Decks9.3 Written Reports9.4 Speaking: What Has Worked for Me9.5 Code Documentation9.6 Further Reading9.7 Glossary
Part II: Stuff You Still Need to Know
10 Unsupervised Learning: Clustering and Dimensionality Reduction
10.1 The Curse of Dimensionality10.2 Example: Eigenfaces for Dimensionality Reduction10.3 Principal Component Analysis and Factor Analysis10.4 Skree Plots and Understanding Dimensionality10.5 Factor Analysis10.6 Limitations of PCA10.7 Clustering10.8 Further Reading10.9 Glossary
11 Regression
11.1 Example: Predicting Diabetes Progression11.2 Fitting a Line with Least Squares11.3 Alternatives to Least Squares11.4 Fitting Nonlinear Curves11.5 Goodness of Fit: R2 and Correlation11.6 Correlation of Residuals11.7 Linear Regression11.8 LASSO Regression and Feature Selection11.9 Further Reading11.10 Glossary
12 Data Encodings and File Formats
12.1 Typical File Format Categories12.2 CSV Files12.3 JSON Files12.4 XML Files12.5 HTML Files12.6 Tar Files12.7 GZip Files12.8 Zip Files12.9 Image Files: Rasterized, Vectorized, and/or Compressed12.10 It’s All Bytes at the End of the Day12.11 Integers12.12 Floats12.13 Text Data12.14 Further Reading12.15 Glossary
13 Big Data
13.1 What Is Big Data?13.2 When to Use – And not Use – Big Data13.3 Hadoop: The File System and the Processor13.4 Example PySpark Script13.5 Spark Overview13.6 Spark Operations13.7 PySpark Data Frames13.8 Two Ways to Run PySpark13.9 Configuring Spark13.10 Under the Hood13.11 Spark Tips and Gotchas13.12 The MapReduce Paradigm13.13 Performance Considerations13.14 Further Reading13.15 Glossary
14 Databases
14.1 Relational Databases and MySQL®14.2 Key–Value Stores14.3 Wide‐Column Stores14.4 Document Stores14.5 Further Reading14.6 Glossary
15 Software Engineering Best Practices
15.1 Coding Style15.2 Version Control and Git for Data Scientists15.3 Testing Code15.4 Test‐Driven Development15.5 AGILE Methodology15.6 Further Reading15.7 Glossary
16 Traditional Natural Language Processing
16.1 Do I Even Need NLP?16.2 The Great Divide: Language Versus Statistics16.3 Example: Sentiment Analysis on Stock Market Articles16.4 Software and Datasets16.5 Tokenization16.6 Central Concept: Bag‐of‐Words16.7 Word Weighting: TF‐IDF16.8 n‐Grams16.9 Stop Words16.10 Lemmatization and Stemming16.11 Synonyms16.12 Part of Speech Tagging16.13 Common Problems16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding16.15 Further Reading16.16 Glossary
17 Time Series Analysis
17.1 Example: Predicting Wikipedia Page Views17.2 A Typical Workflow17.3 Time Series Versus Time‐Stamped Events17.4 Resampling and Interpolation17.5 Smoothing Signals17.6 Logarithms and Other Transformations17.7 Trends and Periodicity17.8 Windowing17.9 Brainstorming Simple Features17.10 Better Features: Time Series as Vectors17.11 Fourier Analysis: Sometimes a Magic Bullet17.12 Time Series in Context: The Whole Suite of Features17.13 Further Reading17.14 Glossary
18 Probability
18.1 Flipping Coins: Bernoulli Random Variables18.2 Throwing Darts: Uniform Random Variables18.3 The Uniform Distribution and Pseudorandom Numbers18.4 Nondiscrete, Noncontinuous Random Variables18.5 Notation, Expectations, and Standard Deviation18.6 Dependence, Marginal, and Conditional Probability18.7 Understanding the Tails18.8 Binomial Distribution18.9 Poisson Distribution18.10 Normal Distribution18.11 Multivariate Gaussian18.12 Exponential Distribution18.13 Log‐Normal Distribution18.14 Entropy18.15 Further Reading18.16 Glossary
19 Statistics
19.1 Statistics in Perspective19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies19.3 Hypothesis Testing: Key Idea and Example19.4 Multiple Hypothesis Testing19.5 Parameter Estimation19.6 Hypothesis Testing: t‐Test19.7 Confidence Intervals19.8 Bayesian Statistics19.9 Naive Bayesian Statistics19.10 Bayesian Networks19.11 Choosing Priors: Maximum Entropy or Domain Knowledge19.12 Further Reading19.13 Glossary
20 Programming Language Concepts
20.1 Programming Paradigms20.2 Compilation and Interpretation20.3 Type Systems20.4 Further Reading20.5 Glossary
21 Performance and Computer Memory
21.1 A Word of Caution21.2 Example Script21.3 Algorithm Performance and Big‐O Notation21.4 Some Classic Problems: Sorting a List and Binary Search21.5 Amortized Performance and Average Performance21.6 Two Principles: Reducing Overhead and Managing Memory21.7 Performance Tip: Use Numerical Libraries When Applicable21.8 Performance Tip: Delete Large Structures You Don’t Need21.9 Performance Tip: Use Built‐In Functions When Possible21.10 Performance Tip: Avoid Superfluous Function Calls21.11 Performance Tip: Avoid Creating Large New Objects21.12 Further Reading21.13 Glossary
Part III: Specialized or Advanced Topics
22 Computer Memory and Data Structures
22.1 Virtual Memory, the Stack, and the Heap22.2 Example C Program22.3 Data Types and Arrays in Memory22.4 Structs22.5 Pointers, the Stack, and the Heap22.6 Key Data Structures22.7 Further Reading22.8 Glossary
23 Maximum‐Likelihood Estimation and Optimization
23.1 Maximum‐Likelihood Estimation23.2 A Simple Example: Fitting a Line23.3 Another Example: Logistic Regression23.4 Optimization23.5 Gradient Descent23.6 Convex Optimization23.7 Stochastic Gradient Descent23.8 Further Reading23.9 Glossary
24 Deep Learning and AI
24.1 A Note on Libraries and Hardware24.2 A Note on Training Data24.3 Simple Deep Learning: Perceptrons24.4 What Is a Tensor?24.5 Convolutional Neural Networks24.6 Example: The MNIST Handwriting Dataset24.7 Autoencoders and Latent Vectors24.8 Generative AI and GANs24.9 Diffusion Models24.10 RNNs, Hidden State, and the Encoder–Decoder24.11 Attention and Transformers24.12 Stable Diffusion: Bringing the Parts Together24.13 Large Language Models and Prompt Engineering24.14 Further Reading24.15 Glossary
25 Stochastic Modeling
25.1 Markov Chains25.2 Two Kinds of Markov Chain, Two Kinds of Questions25.3 Hidden Markov Models and the Viterbi Algorithm25.4 The Viterbi Algorithm25.5 Random Walks25.6 Brownian Motion25.7 ARIMA Models25.8 Continuous‐Time Markov Processes25.9 Poisson Processes25.10 Further Reading25.11 Glossary
26 Parting Words
Index
End User License Agreement

Content preview from The Data Science Handbook, 2nd Edition

4Data Munging: String Manipulation, Regular Expressions, and Data Cleaning

This chapter is about some of the pathologies that you will see in real‐world data. It talks about some of the most common (and notorious!) ones, where they come from, and how they can be addressed.

Data pathologies come in roughly two types. The first are formatting issues. This includes inconsistent capitalization, extraneous whitespaces, and things of that nature. Often, these are straightforward to solve with appropriate preprocessing of the data. The second category involves the actual content of the data. Duplicate entries, major outliers, and NULL values are all examples. It often requires some detective work to figure out what these issues mean in a particular situation and, hence, how they should be addressed.

My goals in this chapter are twofold. First, I want to give you an appreciation for the breadth of issues that can be present in real‐world data and equip you to quickly identify and diagnose problems. Second, I want to teach you tools that can be used to solve the problems. Specifically, I will discuss various types of string manipulation.

Manipulating strings of text might seem boring at first glance, but it’s one of the most powerful tools a data scientist can have. I would put it on par with machine learning itself. String manipulation can be used to address any data formatting problems, and in many cases, it is the only suitable solution. But, it is also invaluable for creating scripts ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical Statistics for Data Scientists, 2nd Edition

Publisher Resources

ISBN: 9781394234493Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Data Science Handbook, 2nd Edition

by Field Cady

4Data Munging: String Manipulation, Regular Expressions, and Data Cleaning

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.