Principles of Data Science

Book description

Learn the techniques and math you need to start making sense of your data

About This Book

  • Enhance your knowledge of coding with data science theory for practical insight into data science and analysis
  • More than just a math class, learn how to perform real-world data science tasks with R and Python
  • Create actionable insights and transform raw data into tangible value

Who This Book Is For

You should be fairly well acquainted with basic algebra and should feel comfortable reading snippets of R/Python as well as pseudo code. You should have the urge to learn and apply the techniques put forth in this book on either your own data sets or those provided to you. If you have the basic math skills but want to apply them in data science or you have good programming skills but lack math, then this book is for you.

What You Will Learn

  • Get to know the five most important steps of data science
  • Use your data intelligently and learn how to handle it with care
  • Bridge the gap between mathematics and programming
  • Learn about probability, calculus, and how to use statistical models to control and clean your data and drive actionable results
  • Build and evaluate baseline machine learning models
  • Explore the most effective metrics to determine the success of your machine learning models
  • Create data visualizations that communicate actionable insights
  • Read and apply machine learning concepts to your problems and make actual predictions

In Detail

Need to turn your skills at programming into effective data science skills? Principles of Data Science is created to help you join the dots between mathematics, programming, and business analysis. With this book, you’ll feel confident about asking—and answering—complex and sophisticated questions of your data to move from abstract and raw statistics to actionable ideas.

With a unique approach that bridges the gap between mathematics and computer science, this books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques, you’ll move on to build a comprehensive picture of how every piece of the data science puzzle fits together. Learn the fundamentals of computational mathematics and statistics, as well as some pseudocode being used today by data scientists and analysts. You’ll get to grips with machine learning, discover the statistical models that help you take control and navigate even the densest datasets, and find out how to create powerful visualizations that communicate what your data means.

Style and approach

This is an easy-to-understand and accessible tutorial. It is a step-by-step guide with use cases, examples, and illustrations to get you well-versed with the concepts of data science. Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts later on and will help you implement these techniques in the real world.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

Publisher resources

Download Example Code

Table of contents

  1. Principles of Data Science
    1. Table of Contents
    2. Principles of Data Science
    3. Credits
    4. About the Author
    5. About the Reviewers
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. How to Sound Like a Data Scientist
      1. What is data science?
        1. Basic terminology
        2. Why data science?
        3. Example – Sigma Technologies
      2. The data science Venn diagram
        1. The math
          1. Example – spawner-recruit models
        2. Computer programming
        3. Why Python?
          1. Python practices
          2. Example of basic Python
        4. Example – parsing a single tweet
        5. Domain knowledge
      3. Some more terminology
      4. Data science case studies
        1. Case study – automating government paper pushing
          1. Fire all humans, right?
        2. Case study – marketing dollars
        3. Case study – what's in a job description?
      5. Summary
    9. 2. Types of Data
      1. Flavors of data
      2. Why look at these distinctions?
      3. Structured versus unstructured data
        1. Example of data preprocessing
          1. Word/phrase counts
          2. Presence of certain special characters
          3. Relative length of text
          4. Picking out topics
      4. Quantitative versus qualitative data
        1. Example – coffee shop data
        2. Example – world alcohol consumption data
        3. Digging deeper
      5. The road thus far…
      6. The four levels of data
        1. The nominal level
          1. Mathematical operations allowed
          2. Measures of center
          3. What data is like at the nominal level
        2. The ordinal level
          1. Examples
          2. Mathematical operations allowed
          3. Measures of center
          4. Quick recap and check
        3. The interval level
          1. Example
          2. Mathematical operations allowed
          3. Measures of center
          4. Measures of variation
            1. Standard deviation
        4. The ratio level
          1. Examples
          2. Measures of center
          3. Problems with the ratio level
      7. Data is in the eye of the beholder
      8. Summary
    10. 3. The Five Steps of Data Science
      1. Introduction to data science
      2. Overview of the five steps
        1. Ask an interesting question
        2. Obtain the data
        3. Explore the data
        4. Model the data
        5. Communicate and visualize the results
      3. Explore the data
        1. Basic questions for data exploration
        2. Dataset 1 – Yelp
          1. Dataframes
          2. Series
          3. Exploration tips for qualitative data
            1. Nominal level columns
            2. Filtering in Pandas
            3. Ordinal level columns
        3. Dataset 2 – titanic
      4. Summary
    11. 4. Basic Mathematics
      1. Mathematics as a discipline
      2. Basic symbols and terminology
        1. Vectors and matrices
          1. Quick exercises
          2. Answers
        2. Arithmetic symbols
          1. Summation
          2. Proportional
          3. Dot product
        3. Graphs
        4. Logarithms/exponents
        5. Set theory
      3. Linear algebra
        1. Matrix multiplication
          1. How to multiply matrices
      4. Summary
    12. 5. Impossible or Improbable – A Gentle Introduction to Probability
      1. Basic definitions
      2. Probability
      3. Bayesian versus Frequentist
        1. Frequentist approach
          1. The law of large numbers
      4. Compound events
      5. Conditional probability
      6. The rules of probability
        1. The addition rule
        2. Mutual exclusivity
        3. The multiplication rule
        4. Independence
        5. Complementary events
      7. A bit deeper
      8. Summary
    13. 6. Advanced Probability
      1. Collectively exhaustive events
      2. Bayesian ideas revisited
        1. Bayes theorem
        2. More applications of Bayes theorem
          1. Example – Titanic
          2. Example – medical studies
      3. Random variables
        1. Discrete random variables
          1. Types of discrete random variables
            1. Binomial random variables
            2. Poisson random variable,
            3. Continuous random variables
      4. Summary
    14. 7. Basic Statistics
      1. What are statistics?
      2. How do we obtain and sample data?
        1. Obtaining data
          1. Observational
          2. Experimental
      3. Sampling data
        1. Probability sampling
        2. Random sampling
        3. Unequal probability sampling
      4. How do we measure statistics?
        1. Measures of center
        2. Measures of variation
          1. Definition
          2. Example – employee salaries
        3. Measures of relative standing
          1. The insightful part – correlations in data
      5. The Empirical rule
      6. Summary
    15. 8. Advanced Statistics
      1. Point estimates
      2. Sampling distributions
      3. Confidence intervals
      4. Hypothesis tests
        1. Conducting a hypothesis test
        2. One sample t-tests
          1. Example of a one sample t-tests
          2. Assumptions of the one sample t-tests
        3. Type I and type II errors
        4. Hypothesis test for categorical variables
          1. Chi-square goodness of fit test
            1. Assumptions of the chi-square goodness of fit test
            2. Example of a chi-square test for goodness of fit
          2. Chi-square test for association/independence
          3. Assumptions of the chi-square independence test
      5. Summary
    16. 9. Communicating Data
      1. Why does communication matter?
      2. Identifying effective and ineffective visualizations
        1. Scatter plots
        2. Line graphs
        3. Bar charts
        4. Histograms
        5. Box plots
      3. When graphs and statistics lie
        1. Correlation versus causation
        2. Simpson's paradox
        3. If correlation doesn't imply causation, then what does?
      4. Verbal communication
        1. It's about telling a story
        2. On the more formal side of things
      5. The why/how/what strategy of presenting
      6. Summary
    17. 10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials
      1. What is machine learning?
      2. Machine learning isn't perfect
      3. How does machine learning work?
      4. Types of machine learning
        1. Supervised learning
          1. It's not only about predictions
          2. Types of supervised learning
            1. Regression
            2. Classification
          3. Data is in the eyes of the beholder
        2. Unsupervised learning
          1. Reinforcement learning
          2. Overview of the types of machine learning
      5. How does statistical modeling fit into all of this?
      6. Linear regression
        1. Adding more predictors
        2. Regression metrics
      7. Logistic regression
      8. Probability, odds, and log odds
        1. The math of logistic regression
      9. Dummy variables
      10. Summary
    18. 11. Predictions Don't Grow on Trees – or Do They?
      1. Naïve Bayes classification
      2. Decision trees
        1. How does a computer build a regression tree?
        2. How does a computer fit a classification tree?
      3. Unsupervised learning
        1. When to use unsupervised learning
      4. K-means clustering
        1. Illustrative example – data points
        2. Illustrative example – beer!
      5. Choosing an optimal number for K and cluster validation
        1. The Silhouette Coefficient
      6. Feature extraction and principal component analysis
      7. Summary
    19. 12. Beyond the Essentials
      1. The bias variance tradeoff
        1. Error due to bias
        2. Error due to variance
        3. Two extreme cases of bias/variance tradeoff
          1. Underfitting
          2. Overfitting
        4. How bias/variance play into error functions
      2. K folds cross-validation
      3. Grid searching
        1. Visualizing training error versus cross-validation error
      4. Ensembling techniques
        1. Random forests
        2. Comparing Random forests with decision trees
      5. Neural networks
        1. Basic structure
      6. Summary
    20. 13. Case Studies
      1. Case study 1 – predicting stock prices based on social media
        1. Text sentiment analysis
        2. Exploratory data analysis
          1. Regression route
          2. Classification route
        3. Going beyond with this example
      2. Case study 2 – why do some people cheat on their spouses?
      3. Case study 3 – using tensorflow
        1. Tensorflow and neural networks
      4. Summary
    21. Index

Product information

  • Title: Principles of Data Science
  • Author(s): Sinan Ozdemir
  • Release date: December 2016
  • Publisher(s): Packt Publishing
  • ISBN: 9781785887918