Data Science Bookcamp

Book description

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.

In Data Science Bookcamp you will learn:

  • Techniques for computing and plotting probabilities
  • Statistical analysis using Scipy
  • How to organize datasets with clustering algorithms
  • How to visualize complex multi-variable datasets
  • How to train a decision tree machine learning algorithm

In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

About the Technology
A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data.

About the Book
Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results.

What's Inside
  • Web scraping
  • Organize datasets with clustering algorithms
  • Visualize complex multi-variable datasets
  • Train a decision tree machine learning algorithm


About the Reader
For readers who know the basics of Python. No prior data science or machine learning skills required.

About the Author
Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse.

Quotes
Valuable and accessible… a solid foundation for anyone aspiring to be a data scientist.
- Amaresh Rajasekharan, IBM Corporation

Really good introduction of statistical data science concepts. A must-have for every beginner!
- Simone Sguazza, University of Applied Sciences and Arts of Southern Switzerland

A full-fledged tutorial in data science including common Python libraries and language tricks!
- Jean-François Morin, Laval University

This book is a complete package for understanding how the data science process works end to end.
- Ayon Roy, Internshala

Publisher resources

View/Submit Errata

Table of contents

  1. inside front cover
  2. Data Science Bookcamp
  3. Copyright
  4. dedication
  5. brief contents
  6. contents
  7. front matter
    1. preface
    2. acknowledgments
    3. about this book
      1. Who should read this book
      2. How this book is organized
      3. About the code
    4. about the author
    5. about the cover illustration
  8. Part 1. Case study 1: Finding the winning strategy in a card game
    1. Problem statement
    2. Overview
  9. 1 Computing probabilities using Python
    1. 1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes
      1. 1.1.1 Analyzing a biased coin
    2. 1.2 Computing nontrivial probabilities
      1. 1.2.1 Problem 1: Analyzing a family with four children
      2. 1.2.2 Problem 2: Analyzing multiple die rolls
      3. 1.2.3 Problem 3: Computing die-roll probabilities using weighted sample spaces
    3. 1.3 Computing probabilities over interval ranges
      1. 1.3.1 Evaluating extremes using interval analysis
    4. Summary
  10. 2 Plotting probabilities using Matplotlib
    1. 2.1 Basic Matplotlib plots
    2. 2.2 Plotting coin-flip probabilities
      1. 2.2.1 Comparing multiple coin-flip probability distributions
    3. Summary
  11. 3 Running random simulations in NumPy
    1. 3.1 Simulating random coin flips and die rolls using NumPy
      1. 3.1.1 Analyzing biased coin flips
    2. 3.2 Computing confidence intervals using histograms and NumPy arrays
      1. 3.2.1 Binning similar points in histogram plots
      2. 3.2.2 Deriving probabilities from histograms
      3. 3.2.3 Shrinking the range of a high confidence interval
      4. 3.2.4 Computing histograms in NumPy
    3. 3.3 Using confidence intervals to analyze a biased deck of cards
    4. 3.4 Using permutations to shuffle cards
    5. Summary
  12. 4 Case study 1 solution
    1. 4.1 Predicting red cards in a shuffled deck
      1. 4.1.1 Estimating the probability of strategy success
    2. 4.2 Optimizing strategies using the sample space for a 10-card deck
    3. Summary
  13. Part 2. Case study 2: Assessing online ad clicks for significance
    1. Problem statement
    2. Dataset description
    3. Overview
  14. 5 Basic probability and statistical analysis using SciPy
    1. 5.1 Exploring the relationships between data and probability using SciPy
    2. 5.2 Mean as a measure of centrality
      1. 5.2.1 Finding the mean of a probability distribution
    3. 5.3 Variance as a measure of dispersion
      1. 5.3.1 Finding the variance of a probability distribution
    4. Summary
  15. 6 Making predictions using the central limit theorem and SciPy
    1. 6.1 Manipulating the normal distribution using SciPy
      1. 6.1.1 Comparing two sampled normal curves
    2. 6.2 Determining the mean and variance of a population through random sampling
    3. 6.3 Making predictions using the mean and variance
      1. 6.3.1 Computing the area beneath a normal curve
      2. 6.3.2 Interpreting the computed probability
    4. Summary
  16. 7 Statistical hypothesis testing
    1. 7.1 Assessing the divergence between sample mean and population mean
    2. 7.2 Data dredging: Coming to false conclusions through oversampling
    3. 7.3 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown
    4. 7.4 Permutation testing: Comparing means of samples when the population parameters are unknown
    5. Summary
  17. 8 Analyzing tables using Pandas
    1. 8.1 Storing tables using basic Python
    2. 8.2 Exploring tables using Pandas
    3. 8.3 Retrieving table columns
    4. 8.4 Retrieving table rows
    5. 8.5 Modifying table rows and columns
    6. 8.6 Saving and loading table data
    7. 8.7 Visualizing tables using Seaborn
    8. Summary
  18. 9 Case study 2 solution
    1. 9.1 Processing the ad-click table in Pandas
    2. 9.2 Computing p-values from differences in means
    3. 9.3 Determining statistical significance
    4. 9.4 41 shades of blue: A real-life cautionary tale
    5. Summary
  19. Part 3. Case study 3: Tracking disease outbreaks using news headlines
    1. Problem statement
      1. Dataset description
    2. Overview
  20. 10 Clustering data into groups
    1. 10.1 Using centrality to discover clusters
    2. 10.2 K-means: A clustering algorithm for grouping data into K central groups
      1. 10.2.1 K-means clustering using scikit-learn
      2. 10.2.2 Selecting the optimal K using the elbow method
    3. 10.3 Using density to discover clusters
    4. 10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density
      1. 10.4.1 Comparing DBSCAN and K-means
      2. 10.4.2 Clustering based on non-Euclidean distance
    5. 10.5 Analyzing clusters using Pandas
    6. Summary
  21. 11 Geographic location visualization and analysis
    1. 11.1 The great-circle distance: A metric for computing the distance between two global points
    2. 11.2 Plotting maps using Cartopy
      1. 11.2.1 Manually installing GEOS and Cartopy
      2. 11.2.2 Utilizing the Conda package manager
      3. 11.2.3 Visualizing maps
    3. 11.3 Location tracking using GeoNamesCache
      1. 11.3.1 Accessing country information
      2. 11.3.2 Accessing city information
      3. 11.3.3 Limitations of the GeoNamesCache library
    4. 11.4 Matching location names in text
    5. Summary
  22. 12 Case study 3 solution
    1. 12.1 Extracting locations from headline data
    2. 12.2 Visualizing and clustering the extracted location data
    3. 12.3 Extracting insights from location clusters
    4. Summary
  23. Part 4. Case study 4: Using online job postings to improve your data science resume
    1. Problem statement
      1. Dataset description
    2. Overview
  24. 13 Measuring text similarities
    1. 13.1 Simple text comparison
      1. 13.1.1 Exploring the Jaccard similarity
      2. 13.1.2 Replacing words with numeric values
    2. 13.2 Vectorizing texts using word counts
      1. 13.2.1 Using normalization to improve TF vector similarity
      2. 13.2.2 Using unit vector dot products to convert between relevance metrics
    3. 13.3 Matrix multiplication for efficient similarity calculation
      1. 13.3.1 Basic matrix operations
      2. 13.3.2 Computing all-by-all matrix similarities
    4. 13.4 Computational limits of matrix multiplication
    5. Summary
  25. 14 Dimension reduction of matrix data
    1. 14.1 Clustering 2D data in one dimension
      1. 14.1.1 Reducing dimensions using rotation
    2. 14.2 Dimension reduction using PCA and scikit-learn
    3. 14.3 Clustering 4D data in two dimensions
      1. 14.3.1 Limitations of PCA
    4. 14.4 Computing principal components without rotation
      1. 14.4.1 Extracting eigenvectors using power iteration
    5. 14.5 Efficient dimension reduction using SVD and scikit-learn
    6. Summary
  26. 15 NLP analysis of large text datasets
    1. 15.1 Loading online forum discussions using scikit-learn
    2. 15.2 Vectorizing documents using scikit-learn
    3. 15.3 Ranking words by both post frequency and count
      1. 15.3.1 Computing TFIDF vectors with scikit-learn
    4. 15.4 Computing similarities across large document datasets
    5. 15.5 Clustering texts by topic
      1. 15.5.1 Exploring a single text cluster
    6. 15.6 Visualizing text clusters
      1. 15.6.1 Using subplots to display multiple word clouds
    7. Summary
  27. 16 Extracting text from web pages
    1. 16.1 The structure of HTML documents
    2. 16.2 Parsing HTML using Beautiful Soup
    3. 16.3 Downloading and parsing online data
    4. Summary
  28. 17 Case study 4 solution
    1. 17.1 Extracting skill requirements from job posting data
      1. 17.1.1 Exploring the HTML for skill descriptions
    2. 17.2 Filtering jobs by relevance
    3. 17.3 Clustering skills in relevant job postings
      1. 17.3.1 Grouping the job skills into 15 clusters
      2. 17.3.2 Investigating the technical skill clusters
      3. 17.3.3 Investigating the soft-skill clusters
      4. 17.3.4 Exploring clusters at alternative values of K
      5. 17.3.5 Analyzing the 700 most relevant postings
    4. 17.4 Conclusion
    5. Summary
  29. Part 5. Case study 5: Predicting future friendships from social network data
    1. Problem statement
      1. Introducing the friend-of-a-friend recommendation algorithm
      2. Predicting user behavior
    2. Dataset description
      1. The Profiles table
      2. The Observations table
      3. The Friendships table
    3. Overview
  30. 18 An introduction to graph theory and network analysis
    1. 18.1 Using basic graph theory to rank websites by popularity
      1. 18.1.1 Analyzing web networks using NetworkX
    2. 18.2 Utilizing undirected graphs to optimize the travel time between towns
      1. 18.2.1 Modeling a complex network of towns and counties
      2. 18.2.2 Computing the fastest travel time between nodes
    3. Summary
  31. 19 Dynamic graph theory techniques for node ranking and social network analysis
    1. 19.1 Uncovering central nodes based on expected traffic in a network
      1. 19.1.1 Measuring centrality using traffic simulations
    2. 19.2 Computing travel probabilities using matrix multiplication
      1. 19.2.1 Deriving PageRank centrality from probability theory
      2. 19.2.2 Computing PageRank centrality using NetworkX
    3. 19.3 Community detection using Markov clustering
    4. 19.4 Uncovering friend groups in social networks
    5. Summary
  32. 20 Network-driven supervised machine learning
    1. 20.1 The basics of supervised machine learning
    2. 20.2 Measuring predicted label accuracy
      1. 20.2.1 Scikit-learn’s prediction measurement functions
    3. 20.3 Optimizing KNN performance
    4. 20.4 Running a grid search using scikit-learn
    5. 20.5 Limitations of the KNN algorithm
    6. Summary
  33. 21 Training linear classifiers with logistic regression
    1. 21.1 Linearly separating customers by size
    2. 21.2 Training a linear classifier
      1. 21.2.1 Improving perceptron performance through standardization
    3. 21.3 Improving linear classification with logistic regression
      1. 21.3.1 Running logistic regression on more than two features
    4. 21.4 Training linear classifiers using scikit-learn
      1. 21.4.1 Training multiclass linear models
    5. 21.5 Measuring feature importance with coefficients
    6. 21.6 Linear classifier limitations
    7. Summary
  34. 22 Training nonlinear classifiers with decision tree techniques
    1. 22.1 Automated learning of logical rules
      1. 22.1.1 Training a nested if/else model using two features
      2. 22.1.2 Deciding which feature to split on
      3. 22.1.3 Training if/else models with more than two features
    2. 22.2 Training decision tree classifiers using scikit-learn
      1. 22.2.1 Studying cancerous cells using feature importance
    3. 22.3 Decision tree classifier limitations
    4. 22.4 Improving performance using random forest classification
    5. 22.5 Training random forest classifiers using scikit-learn
    6. Summary
  35. 23 Case study 5 solution
    1. 23.1 Exploring the data
      1. 23.1.1 Examining the profiles
      2. 23.1.2 Exploring the experimental observations
      3. 23.1.3 Exploring the Friendships linkage table
    2. 23.2 Training a predictive model using network features
    3. 23.3 Adding profile features to the model
    4. 23.4 Optimizing performance across a steady set of features
    5. 23.5 Interpreting the trained model
      1. 23.5.1 Why are generalizable models so important?
    6. Summary
  36. index
  37. inside back cover

Product information

  • Title: Data Science Bookcamp
  • Author(s): Leonard Apeltsin
  • Release date: November 2021
  • Publisher(s): Manning Publications
  • ISBN: 9781617296253