Book description
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
Topics include:
- Statistical inference, exploratory data analysis, and the data science process
- Algorithms
- Spam filters, Naive Bayes, and data wrangling
- Logistic regression
- Financial modeling
- Recommendation engines and causality
- Data visualization
- Social networks and data journalism
- Data engineering, MapReduce, Pregel, and Hadoop
Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.
Table of contents
-
Preface
- Motivation
- Origins of the Class
- Origins of the Book
- What to Expect from This Book
- How This Book Is Organized
- How to Read This Book
- How Code Is Used in This Book
- Who This Book Is For
- Prerequisites
- Supplemental Reading
- About the Contributors
- Conventions Used in This Book
- Using Code Examples
- O’Reilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Introduction: What Is Data Science?
- 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
- 3. Algorithms
- 4. Spam Filters, Naive Bayes, and Wrangling
- 5. Logistic Regression
-
6. Time Stamps and Financial Modeling
- Kyle Teague and GetGlue
- Timestamps
- Cathy O’Neil
- Thought Experiment
-
Financial Modeling
- In-Sample, Out-of-Sample, and Causality
- Preparing Financial Data
- Log Returns
- Example: The S&P Index
- Working out a Volatility Measurement
- Exponential Downweighting
- The Financial Modeling Feedback Loop
- Why Regression?
- Adding Priors
- A Baby Model
- Exercise: GetGlue and Timestamped Event Data
- Exercise: Financial Data
- 7. Extracting Meaning from Data
-
8. Recommendation Engines: Building a User-Facing Data Product at Scale
-
A Real-World Recommendation Engine
- Nearest Neighbor Algorithm Review
- Some Problems with Nearest Neighbors
- Beyond Nearest Neighbor: Machine Learning Classification
- The Dimensionality Problem
- Singular Value Decomposition (SVD)
- Important Properties of SVD
- Principal Component Analysis (PCA)
- Alternating Least Squares
- Fix V and Update U
- Last Thoughts on These Algorithms
- Thought Experiment: Filter Bubbles
- Exercise: Build Your Own Recommendation System
-
A Real-World Recommendation Engine
- 9. Data Visualization and Fraud Detection
- 10. Social Networks and Data Journalism
- 11. Causality
- 12. Epidemiology
- 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
- 14. Data Engineering: MapReduce, Pregel, and Hadoop
- 15. The Students Speak
- 16. Next-Generation Data Scientists, Hubris, and Ethics
- Index
Product information
- Title: Doing Data Science
- Author(s):
- Release date: October 2013
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449358655
You might also like
book
Designing Large Language Model Applications
Transformer-based language models are powerful tools for solving a variety of language tasks and represent a …
book
Creating a Data-Driven Organization
What do you need to become a data-driven organization? Far more than having big data or …
audiobook
The Design of Everyday Things
First, businesses discovered quality as a key competitive edge; next came science. Now, Donald A. Norman, …
book
Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …