Book description
A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline
Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to realworld applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving realworld data problems. The book also features:
• Extensive sample code and tutorials using Python™ along with its technical libraries
• Core technologies of “Big Data,” including their strengths and limitations and how they can be used to solve realworld problems
• Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity
• A wide variety of case studies from industry
• Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed
The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entrylevel graduate students who need to learn realworld analytics and expand their skill set.
FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.
Table of contents
 Cover
 Title Page
 Preface

Part I: The Stuff You'll Always Use
 Chapter 2: The Data Science Road Map
 Chapter 3: Programming Languages
 Interlude: My Personal Toolkit
 Chapter 4: Data Munging: String Manipulation, Regular Expressions, and Data Cleaning

Chapter 5: Visualizations and Simple Metrics
 5.1 A Note on Python's Visualization Tools
 5.2 Example Code
 5.3 Pie Charts
 5.4 Bar Charts
 5.5 Histograms
 5.6 Means, Standard Deviations, Medians, and Quantiles
 5.7 Boxplots
 5.8 Scatterplots
 5.9 Scatterplots with Logarithmic Axes
 5.10 Scatter Matrices
 5.11 Heatmaps
 5.12 Correlations
 5.13 Anscombe's Quartet and the Limits of Numbers
 5.14 Time Series
 5.15 Further Reading
 5.16 Glossary
 Chapter 6: Machine Learning Overview
 Chapter 7: Interlude: Feature Extraction Ideas
 Chapter 8: Machine Learning Classification
 Chapter 9: Technical Communication and Documentation

Part II: Stuff You Still Need to Know
 Chapter 10: Unsupervised Learning: Clustering and Dimensionality Reduction
 Chapter 11: Regression

Chapter 12: Data Encodings and File Formats
 12.1 Typical File Format Categories
 12.2 CSV Files
 12.3 JSON Files
 12.4 XML Files
 12.5 HTML Files
 12.6 Tar Files
 12.7 GZip Files
 12.8 Zip Files
 12.9 Image Files: Rasterized, Vectorized, and/or Compressed
 12.10 It's All Bytes at the End of the Day
 12.11 Integers
 12.12 Floats
 12.13 Text Data
 12.14 Further Reading
 12.15 Glossary

Chapter 13: Big Data
 13.1 What Is Big Data?
 13.2 Hadoop: The File System and the Processor
 13.3 Using HDFS
 13.4 Example PySpark Script
 13.5 Spark Overview
 13.6 Spark Operations
 13.7 Two Ways to Run PySpark
 13.8 Configuring Spark
 13.9 Under the Hood
 13.10 Spark Tips and Gotchas
 13.11 The MapReduce Paradigm
 13.12 Performance Considerations
 13.13 Further Reading
 13.14 Glossary
 Chapter 14: Databases
 Chapter 15: Software Engineering Best Practices

Chapter 16: Natural Language Processing
 16.1 Do I Even Need NLP?
 16.2 The Great Divide: Language versus Statistics
 16.3 Example: Sentiment Analysis on Stock Market Articles
 16.4 Software and Datasets
 16.5 Tokenization
 16.6 Central Concept: BagofWords
 16.7 Word Weighting: TFIDF
 16.8 nGrams
 16.9 Stop Words
 16.10 Lemmatization and Stemming
 16.11 Synonyms
 16.12 Part of Speech Tagging
 16.13 Common Problems
 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
 16.15 Further Reading
 16.16 Glossary

Chapter 17: Time Series Analysis
 17.1 Example: Predicting Wikipedia Page Views
 17.2 A Typical Workflow
 17.3 Time Series versus TimeStamped Events
 17.4 Resampling an Interpolation
 17.5 Smoothing Signals
 17.6 Logarithms and Other Transformations
 17.7 Trends and Periodicity
 17.8 Windowing
 17.9 Brainstorming Simple Features
 17.10 Better Features: Time Series as Vectors
 17.11 Fourier Analysis: Sometimes a Magic Bullet
 17.12 Time Series in Context: The Whole Suite of Features
 17.13 Further Reading
 17.14 Glossary

Chapter 18: Probability
 18.1 Flipping Coins: Bernoulli Random Variables
 18.2 Throwing Darts: Uniform Random Variables
 18.3 The Uniform Distribution and Pseudorandom Numbers
 18.4 Nondiscrete, Noncontinuous Random Variables
 18.5 Notation, Expectations, and Standard Deviation
 18.6 Dependence, Marginal and Conditional Probability
 18.7 Understanding the Tails
 18.8 Binomial Distribution
 18.9 Poisson Distribution
 18.10 Normal Distribution
 18.11 Multivariate Gaussian
 18.12 Exponential Distribution
 18.13 LogNormal Distribution
 18.14 Entropy
 18.15 Further Reading
 18.16 Glossary

Chapter 19: Statistics
 19.1 Statistics in Perspective
 19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
 19.3 Hypothesis Testing: Key Idea and Example
 19.4 Multiple Hypothesis Testing
 19.5 Parameter Estimation
 19.6 Hypothesis Testing: tTest
 19.7 Confidence Intervals
 19.8 Bayesian Statistics
 19.9 Naive Bayesian Statistics
 19.10 Bayesian Networks
 19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
 19.12 Further Reading
 19.13 Glossary
 Chapter 20: Programming Language Concepts

Chapter 21: Performance and Computer Memory
 21.1 Example Script
 21.2 Algorithm Performance and BigO Notation
 21.3 Some Classic Problems: Sorting a List and Binary Search
 21.4 Amortized Performance and Average Performance
 21.5 Two Principles: Reducing Overhead and Managing Memory
 21.6 Performance Tip: Use Numerical Libraries When Applicable
 21.7 Performance Tip: Delete Large Structures You Don't Need
 21.8 Performance Tip: Use BuiltIn Functions When Possible
 21.9 Performance Tip: Avoid Superfluous Function Calls
 21.10 Performance Tip: Avoid Creating Large New Objects
 21.11 Further Reading
 21.12 Glossary

Part III: Specialized or Advanced Topics
 Chapter 22: Computer Memory and Data Structures
 Chapter 23: Maximum Likelihood Estimation and Optimization

Chapter 24: Advanced Classifiers
 24.1 A Note on Libraries
 24.2 Basic Deep Learning
 24.3 Convolutional Neural Networks
 24.4 Different Types of Layers. What the Heck Is a Tensor?
 24.5 Example: The MNIST Handwriting Dataset
 24.6 Recurrent Neural Networks
 24.7 Bayesian Networks
 24.8 Training and Prediction
 24.9 Markov Chain Monte Carlo
 24.10 PyMC Example
 24.11 Further Reading
 24.12 Glossary

Chapter 25: Stochastic Modeling
 25.1 Markov Chains
 25.2 Two Kinds of Markov Chain, Two Kinds of Questions
 25.3 Markov Chain Monte Carlo
 25.4 Hidden Markov Models and the Viterbi Algorithm
 25.5 The Viterbi Algorithm
 25.6 Random Walks
 25.7 Brownian Motion
 25.8 ARIMA Models
 25.9 ContinuousTime Markov Processes
 25.10 Poisson Processes
 25.11 Further Reading
 25.12 Glossary
 Parting Words: Your Future as a Data Scientist
 Index
 End User License Agreement
Product information
 Title: The Data Science Handbook
 Author(s):
 Release date: February 2017
 Publisher(s): Wiley
 ISBN: 9781119092940
You might also like
book
Introduction to Probability
Developed from celebrated Harvard statistics lectures, Introduction to Probability provides essential language and tools for understanding …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …
book
You, Only Better
If you want to be happy, fulfilled and energised its probably best not to obsess over …
video
Python Fundamentals
51+ hours of video instruction. Overview The professional programmer’s Deitel® video guide to Python development with …