Book description
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wideranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
In many of these chapterlong lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
Topics include:
 Statistical inference, exploratory data analysis, and the data science process
 Algorithms
 Spam filters, Naive Bayes, and data wrangling
 Logistic regression
 Financial modeling
 Recommendation engines and causality
 Data visualization
 Social networks and data journalism
 Data engineering, MapReduce, Pregel, and Hadoop
Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.
Table of contents

Preface
 Motivation
 Origins of the Class
 Origins of the Book
 What to Expect from This Book
 How This Book Is Organized
 How to Read This Book
 How Code Is Used in This Book
 Who This Book Is For
 Prerequisites
 Supplemental Reading
 About the Contributors
 Conventions Used in This Book
 Using Code Examples
 O’Reilly Online Learning
 How to Contact Us
 Acknowledgments
 1. Introduction: What Is Data Science?
 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
 3. Algorithms
 4. Spam Filters, Naive Bayes, and Wrangling
 5. Logistic Regression

6. Time Stamps and Financial Modeling
 Kyle Teague and GetGlue
 Timestamps
 Cathy O’Neil
 Thought Experiment

Financial Modeling
 InSample, OutofSample, and Causality
 Preparing Financial Data
 Log Returns
 Example: The S&P Index
 Working out a Volatility Measurement
 Exponential Downweighting
 The Financial Modeling Feedback Loop
 Why Regression?
 Adding Priors
 A Baby Model
 Exercise: GetGlue and Timestamped Event Data
 Exercise: Financial Data
 7. Extracting Meaning from Data

8. Recommendation Engines: Building a UserFacing Data Product at Scale

A RealWorld Recommendation Engine
 Nearest Neighbor Algorithm Review
 Some Problems with Nearest Neighbors
 Beyond Nearest Neighbor: Machine Learning Classification
 The Dimensionality Problem
 Singular Value Decomposition (SVD)
 Important Properties of SVD
 Principal Component Analysis (PCA)
 Alternating Least Squares
 Fix V and Update U
 Last Thoughts on These Algorithms
 Thought Experiment: Filter Bubbles
 Exercise: Build Your Own Recommendation System

A RealWorld Recommendation Engine
 9. Data Visualization and Fraud Detection
 10. Social Networks and Data Journalism
 11. Causality
 12. Epidemiology
 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
 14. Data Engineering: MapReduce, Pregel, and Hadoop
 15. The Students Speak
 16. NextGeneration Data Scientists, Hubris, and Ethics
 Index
Product information
 Title: Doing Data Science
 Author(s):
 Release date: October 2013
 Publisher(s): O'Reilly Media, Inc.
 ISBN: 9781449358655
You might also like
book
Analytical Skills for AI and Data Science
While several marketleading companies have successfully transformed their business models by following data and AIdriven paths, …
book
40 Algorithms Every Programmer Should Know
Learn algorithms for solving classic computer science problems with this concise guide covering everything from fundamental …
book
Storytelling with Data: A Data Visualization Guide for Business Professionals
Don't simply show your data—tell a story with it! Storytelling with Data teaches you the fundamentals …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …