book

The Data Science Handbook

Name: The Data Science Handbook
Author: Field Cady
ISBN: 9781119092940

by Field Cady

February 2017

Beginner to intermediate

416 pages

10h 39m

English

Wiley

Read now

Unlock full access

Cover
Title Page
Copyright
Dedication
Preface
Chapter 1: Introduction: Becoming a Unicorn
1.1 Aren't Data Scientists Just Overpaid Statisticians?1.2 How Is This Book Organized?1.3 How to Use This Book?1.4 Why Is It All in Python™, Anyway?1.5 Example Code and Datasets1.6 Parting Words
Part I: The Stuff You'll Always Use
Chapter 2: The Data Science Road Map
2.1 Frame the Problem2.2 Understand the Data: Basic Questions2.3 Understand the Data: Data Wrangling2.4 Understand the Data: Exploratory Analysis2.5 Extract Features2.6 Model2.7 Present Results2.8 Deploy Code2.9 Iterating2.10 Glossary
Chapter 3: Programming Languages
3.1 Why Use a Programming Language? What Are the Other Options?3.2 A Survey of Programming Languages for Data Science3.3 Python Crash Course3.4 Strings3.5 Defining Functions3.6 Python's Technical Libraries3.7 Other Python Resources3.8 Further Reading3.9 Glossary
Interlude: My Personal Toolkit

Chapter 4: Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
4.1 The Worst Dataset in the World4.2 How to Identify Pathologies4.3 Problems with Data Content4.4 Formatting Issues4.5 Example Formatting Script4.6 Regular Expressions4.7 Life in the Trenches4.8 Glossary
Chapter 5: Visualizations and Simple Metrics
5.1 A Note on Python's Visualization Tools5.2 Example Code5.3 Pie Charts5.4 Bar Charts5.5 Histograms5.6 Means, Standard Deviations, Medians, and Quantiles5.7 Boxplots5.8 Scatterplots5.9 Scatterplots with Logarithmic Axes5.10 Scatter Matrices5.11 Heatmaps5.12 Correlations5.13 Anscombe's Quartet and the Limits of Numbers5.14 Time Series5.15 Further Reading5.16 Glossary
Chapter 6: Machine Learning Overview
6.1 Historical Context6.2 Supervised versus Unsupervised6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting6.4 Further Reading6.5 Glossary
Chapter 7: Interlude: Feature Extraction Ideas
7.1 Standard Features7.2 Features That Involve Grouping7.3 Preview of More Sophisticated Features7.4 Defining the Feature You Want to Predict
Chapter 8: Machine Learning Classification
8.1 What Is a Classifier, and What Can You Do with It?8.2 A Few Practical Concerns8.3 Binary versus Multiclass8.4 Example Script8.5 Specific Classifiers8.6 Evaluating Classifiers8.7 Selecting Classification Cutoffs8.8 Further Reading8.9 Glossary
Chapter 9: Technical Communication and Documentation
9.1 Several Guiding Principles9.2 Slide Decks9.3 Written Reports9.4 Speaking: What Has Worked for Me9.5 Code Documentation9.6 Further Reading9.7 Glossary
Part II: Stuff You Still Need to Know
Chapter 10: Unsupervised Learning: Clustering and Dimensionality Reduction
10.1 The Curse of Dimensionality10.2 Example: Eigenfaces for Dimensionality Reduction10.3 Principal Component Analysis and Factor Analysis10.4 Skree Plots and Understanding Dimensionality10.5 Factor Analysis10.6 Limitations of PCA10.7 Clustering10.8 Further Reading10.9 Glossary
Chapter 11: Regression
11.1 Example: Predicting Diabetes Progression11.2 Least Squares11.3 Fitting Nonlinear Curves11.4 Goodness of Fit: R2 and Correlation11.5 Correlation of Residuals11.6 Linear Regression11.7 LASSO Regression and Feature Selection11.8 Further Reading11.9 Glossary
Chapter 12: Data Encodings and File Formats
12.1 Typical File Format Categories12.2 CSV Files12.3 JSON Files12.4 XML Files12.5 HTML Files12.6 Tar Files12.7 GZip Files12.8 Zip Files12.9 Image Files: Rasterized, Vectorized, and/or Compressed12.10 It's All Bytes at the End of the Day12.11 Integers12.12 Floats12.13 Text Data12.14 Further Reading12.15 Glossary
Chapter 13: Big Data
13.1 What Is Big Data?13.2 Hadoop: The File System and the Processor13.3 Using HDFS13.4 Example PySpark Script13.5 Spark Overview13.6 Spark Operations13.7 Two Ways to Run PySpark13.8 Configuring Spark13.9 Under the Hood13.10 Spark Tips and Gotchas13.11 The MapReduce Paradigm13.12 Performance Considerations13.13 Further Reading13.14 Glossary
Chapter 14: Databases
14.1 Relational Databases and MySQL®14.2 Key-Value Stores14.3 Wide Column Stores14.4 Document Stores14.5 Further Reading14.6 Glossary
Chapter 15: Software Engineering Best Practices
15.1 Coding Style15.2 Version Control and Git for Data Scientists15.3 Testing Code15.4 Test-Driven Development15.5 AGILE Methodology15.6 Further Reading15.7 Glossary
Chapter 16: Natural Language Processing
16.1 Do I Even Need NLP?16.2 The Great Divide: Language versus Statistics16.3 Example: Sentiment Analysis on Stock Market Articles16.4 Software and Datasets16.5 Tokenization16.6 Central Concept: Bag-of-Words16.7 Word Weighting: TF-IDF16.8 n-Grams16.9 Stop Words16.10 Lemmatization and Stemming16.11 Synonyms16.12 Part of Speech Tagging16.13 Common Problems16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding16.15 Further Reading16.16 Glossary
Chapter 17: Time Series Analysis
17.1 Example: Predicting Wikipedia Page Views17.2 A Typical Workflow17.3 Time Series versus Time-Stamped Events17.4 Resampling an Interpolation17.5 Smoothing Signals17.6 Logarithms and Other Transformations17.7 Trends and Periodicity17.8 Windowing17.9 Brainstorming Simple Features17.10 Better Features: Time Series as Vectors17.11 Fourier Analysis: Sometimes a Magic Bullet17.12 Time Series in Context: The Whole Suite of Features17.13 Further Reading17.14 Glossary
Chapter 18: Probability
18.1 Flipping Coins: Bernoulli Random Variables18.2 Throwing Darts: Uniform Random Variables18.3 The Uniform Distribution and Pseudorandom Numbers18.4 Nondiscrete, Noncontinuous Random Variables18.5 Notation, Expectations, and Standard Deviation18.6 Dependence, Marginal and Conditional Probability18.7 Understanding the Tails18.8 Binomial Distribution18.9 Poisson Distribution18.10 Normal Distribution18.11 Multivariate Gaussian18.12 Exponential Distribution18.13 Log-Normal Distribution18.14 Entropy18.15 Further Reading18.16 Glossary
Chapter 19: Statistics
19.1 Statistics in Perspective19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies19.3 Hypothesis Testing: Key Idea and Example19.4 Multiple Hypothesis Testing19.5 Parameter Estimation19.6 Hypothesis Testing: t-Test19.7 Confidence Intervals19.8 Bayesian Statistics19.9 Naive Bayesian Statistics19.10 Bayesian Networks19.11 Choosing Priors: Maximum Entropy or Domain Knowledge19.12 Further Reading19.13 Glossary
Chapter 20: Programming Language Concepts
20.1 Programming Paradigms20.2 Compilation and Interpretation20.3 Type Systems20.4 Further Reading20.5 Glossary
Chapter 21: Performance and Computer Memory
21.1 Example Script21.2 Algorithm Performance and Big-O Notation21.3 Some Classic Problems: Sorting a List and Binary Search21.4 Amortized Performance and Average Performance21.5 Two Principles: Reducing Overhead and Managing Memory21.6 Performance Tip: Use Numerical Libraries When Applicable21.7 Performance Tip: Delete Large Structures You Don't Need21.8 Performance Tip: Use Built-In Functions When Possible21.9 Performance Tip: Avoid Superfluous Function Calls21.10 Performance Tip: Avoid Creating Large New Objects21.11 Further Reading21.12 Glossary
Part III: Specialized or Advanced Topics
Chapter 22: Computer Memory and Data Structures
22.1 Virtual Memory, the Stack, and the Heap22.2 Example C Program22.3 Data Types and Arrays in Memory22.4 Structs22.5 Pointers, the Stack, and the Heap22.6 Key Data Structures22.7 Further Reading22.8 Glossary
Chapter 23: Maximum Likelihood Estimation and Optimization
23.1 Maximum Likelihood Estimation23.2 A Simple Example: Fitting a Line23.3 Another Example: Logistic Regression23.4 Optimization23.5 Gradient Descent and Convex Optimization23.6 Convex Optimization23.7 Stochastic Gradient Descent23.8 Further Reading23.9 Glossary
Chapter 24: Advanced Classifiers
24.1 A Note on Libraries24.2 Basic Deep Learning24.3 Convolutional Neural Networks24.4 Different Types of Layers. What the Heck Is a Tensor?24.5 Example: The MNIST Handwriting Dataset24.6 Recurrent Neural Networks24.7 Bayesian Networks24.8 Training and Prediction24.9 Markov Chain Monte Carlo24.10 PyMC Example24.11 Further Reading24.12 Glossary
Chapter 25: Stochastic Modeling
25.1 Markov Chains25.2 Two Kinds of Markov Chain, Two Kinds of Questions25.3 Markov Chain Monte Carlo25.4 Hidden Markov Models and the Viterbi Algorithm25.5 The Viterbi Algorithm25.6 Random Walks25.7 Brownian Motion25.8 ARIMA Models25.9 Continuous-Time Markov Processes25.10 Poisson Processes25.11 Further Reading25.12 Glossary
Parting Words: Your Future as a Data Scientist
Index
End User License Agreement

Content preview from The Data Science Handbook

Chapter 12Data Encodings and File Formats

Coming from a background of academic physics, my first years in data science were one big exercise in discovering new data formats that I probably should have already known about. It was a bit demoralizing at the time, so let me make something clear upfront: people are always dreaming up new data types and formats, and you will forever be playing catch-up on them. However, there are several formats that are common enough you should know them. It seems that every new format that comes out is easily understood as a variation of a previous format, so you'll be on good footing going forward. There are also some broad principles that underlie all formats, and I hope to give you a flavor of them.

First, I will talk about specific file formats that you are likely to encounter as a data scientist. This will include sample code for parsing them, discussions about when they are useful, and some thoughts about the future of data formats.

For the second half of the chapter, I will switch gears to a discussion of how data is laid out in the physical memory of a computer. This will involve peaking under the hood of the computer to look at performance considerations and give you a deeper understanding of the file formats we just discussed. This section will come in handy when you are dealing with particularly gnarly data pathologies or writing code that aims for speed when you are chugging through a dataset.

12.1 Typical File Format Categories

There ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781119092940Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Data Science Handbook

by Field Cady

Chapter 12Data Encodings and File Formats

12.1 Typical File Format Categories

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.