book

Data Science Bookcamp

Name: Data Science Bookcamp
Author: Leonard Apeltsin
ISBN: 9781617296253

by Leonard Apeltsin

November 2021

Beginner to intermediate

704 pages

20h 16m

English

Manning Publications

Audiobook available

Read now

Unlock full access

inside front cover
Data Science Bookcamp
Copyright
dedication
brief contents
contents
front matter
prefaceacknowledgmentsabout this bookWho should read this bookHow this book is organizedAbout the codeabout the authorabout the cover illustration
Part 1. Case study 1: Finding the winning strategy in a card game
Problem statementOverview
1 Computing probabilities using Python
1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes1.1.1 Analyzing a biased coin1.2 Computing nontrivial probabilities1.2.1 Problem 1: Analyzing a family with four children1.2.2 Problem 2: Analyzing multiple die rolls1.2.3 Problem 3: Computing die-roll probabilities using weighted sample spaces1.3 Computing probabilities over interval ranges1.3.1 Evaluating extremes using interval analysisSummary
2 Plotting probabilities using Matplotlib
2.1 Basic Matplotlib plots2.2 Plotting coin-flip probabilities2.2.1 Comparing multiple coin-flip probability distributionsSummary

3 Running random simulations in NumPy
3.1 Simulating random coin flips and die rolls using NumPy3.1.1 Analyzing biased coin flips3.2 Computing confidence intervals using histograms and NumPy arrays3.2.1 Binning similar points in histogram plots3.2.2 Deriving probabilities from histograms3.2.3 Shrinking the range of a high confidence interval3.2.4 Computing histograms in NumPy3.3 Using confidence intervals to analyze a biased deck of cards3.4 Using permutations to shuffle cardsSummary
4 Case study 1 solution
4.1 Predicting red cards in a shuffled deck4.1.1 Estimating the probability of strategy success4.2 Optimizing strategies using the sample space for a 10-card deckSummary
Part 2. Case study 2: Assessing online ad clicks for significance
Problem statementDataset descriptionOverview
5 Basic probability and statistical analysis using SciPy
5.1 Exploring the relationships between data and probability using SciPy5.2 Mean as a measure of centrality5.2.1 Finding the mean of a probability distribution5.3 Variance as a measure of dispersion5.3.1 Finding the variance of a probability distributionSummary
6 Making predictions using the central limit theorem and SciPy
6.1 Manipulating the normal distribution using SciPy6.1.1 Comparing two sampled normal curves6.2 Determining the mean and variance of a population through random sampling6.3 Making predictions using the mean and variance6.3.1 Computing the area beneath a normal curve6.3.2 Interpreting the computed probabilitySummary
7 Statistical hypothesis testing
7.1 Assessing the divergence between sample mean and population mean7.2 Data dredging: Coming to false conclusions through oversampling7.3 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown7.4 Permutation testing: Comparing means of samples when the population parameters are unknownSummary
8 Analyzing tables using Pandas
8.1 Storing tables using basic Python8.2 Exploring tables using Pandas8.3 Retrieving table columns8.4 Retrieving table rows8.5 Modifying table rows and columns8.6 Saving and loading table data8.7 Visualizing tables using SeabornSummary
9 Case study 2 solution
9.1 Processing the ad-click table in Pandas9.2 Computing p-values from differences in means9.3 Determining statistical significance9.4 41 shades of blue: A real-life cautionary taleSummary
Part 3. Case study 3: Tracking disease outbreaks using news headlines
Problem statementDataset descriptionOverview
10 Clustering data into groups
10.1 Using centrality to discover clusters10.2 K-means: A clustering algorithm for grouping data into K central groups10.2.1 K-means clustering using scikit-learn10.2.2 Selecting the optimal K using the elbow method10.3 Using density to discover clusters10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density10.4.1 Comparing DBSCAN and K-means10.4.2 Clustering based on non-Euclidean distance10.5 Analyzing clusters using PandasSummary
11 Geographic location visualization and analysis
11.1 The great-circle distance: A metric for computing the distance between two global points11.2 Plotting maps using Cartopy11.2.1 Manually installing GEOS and Cartopy11.2.2 Utilizing the Conda package manager11.2.3 Visualizing maps11.3 Location tracking using GeoNamesCache11.3.1 Accessing country information11.3.2 Accessing city information11.3.3 Limitations of the GeoNamesCache library11.4 Matching location names in textSummary
12 Case study 3 solution
12.1 Extracting locations from headline data12.2 Visualizing and clustering the extracted location data12.3 Extracting insights from location clustersSummary
Part 4. Case study 4: Using online job postings to improve your data science resume
Problem statementDataset descriptionOverview
13 Measuring text similarities
13.1 Simple text comparison13.1.1 Exploring the Jaccard similarity13.1.2 Replacing words with numeric values13.2 Vectorizing texts using word counts13.2.1 Using normalization to improve TF vector similarity13.2.2 Using unit vector dot products to convert between relevance metrics13.3 Matrix multiplication for efficient similarity calculation13.3.1 Basic matrix operations13.3.2 Computing all-by-all matrix similarities13.4 Computational limits of matrix multiplicationSummary
14 Dimension reduction of matrix data
14.1 Clustering 2D data in one dimension14.1.1 Reducing dimensions using rotation14.2 Dimension reduction using PCA and scikit-learn14.3 Clustering 4D data in two dimensions14.3.1 Limitations of PCA14.4 Computing principal components without rotation14.4.1 Extracting eigenvectors using power iteration14.5 Efficient dimension reduction using SVD and scikit-learnSummary
15 NLP analysis of large text datasets
15.1 Loading online forum discussions using scikit-learn15.2 Vectorizing documents using scikit-learn15.3 Ranking words by both post frequency and count15.3.1 Computing TFIDF vectors with scikit-learn15.4 Computing similarities across large document datasets15.5 Clustering texts by topic15.5.1 Exploring a single text cluster15.6 Visualizing text clusters15.6.1 Using subplots to display multiple word cloudsSummary
16 Extracting text from web pages
16.1 The structure of HTML documents16.2 Parsing HTML using Beautiful Soup16.3 Downloading and parsing online dataSummary
17 Case study 4 solution
17.1 Extracting skill requirements from job posting data17.1.1 Exploring the HTML for skill descriptions17.2 Filtering jobs by relevance17.3 Clustering skills in relevant job postings17.3.1 Grouping the job skills into 15 clusters17.3.2 Investigating the technical skill clusters17.3.3 Investigating the soft-skill clusters17.3.4 Exploring clusters at alternative values of K17.3.5 Analyzing the 700 most relevant postings17.4 ConclusionSummary
Part 5. Case study 5: Predicting future friendships from social network data
Problem statementIntroducing the friend-of-a-friend recommendation algorithmPredicting user behaviorDataset descriptionThe Profiles tableThe Observations tableThe Friendships tableOverview
18 An introduction to graph theory and network analysis
18.1 Using basic graph theory to rank websites by popularity18.1.1 Analyzing web networks using NetworkX18.2 Utilizing undirected graphs to optimize the travel time between towns18.2.1 Modeling a complex network of towns and counties18.2.2 Computing the fastest travel time between nodesSummary
19 Dynamic graph theory techniques for node ranking and social network analysis
19.1 Uncovering central nodes based on expected traffic in a network19.1.1 Measuring centrality using traffic simulations19.2 Computing travel probabilities using matrix multiplication19.2.1 Deriving PageRank centrality from probability theory19.2.2 Computing PageRank centrality using NetworkX19.3 Community detection using Markov clustering19.4 Uncovering friend groups in social networksSummary
20 Network-driven supervised machine learning
20.1 The basics of supervised machine learning20.2 Measuring predicted label accuracy20.2.1 Scikit-learn’s prediction measurement functions20.3 Optimizing KNN performance20.4 Running a grid search using scikit-learn20.5 Limitations of the KNN algorithmSummary
21 Training linear classifiers with logistic regression
21.1 Linearly separating customers by size21.2 Training a linear classifier21.2.1 Improving perceptron performance through standardization21.3 Improving linear classification with logistic regression21.3.1 Running logistic regression on more than two features21.4 Training linear classifiers using scikit-learn21.4.1 Training multiclass linear models21.5 Measuring feature importance with coefficients21.6 Linear classifier limitationsSummary
22 Training nonlinear classifiers with decision tree techniques
22.1 Automated learning of logical rules22.1.1 Training a nested if/else model using two features22.1.2 Deciding which feature to split on22.1.3 Training if/else models with more than two features22.2 Training decision tree classifiers using scikit-learn22.2.1 Studying cancerous cells using feature importance22.3 Decision tree classifier limitations22.4 Improving performance using random forest classification22.5 Training random forest classifiers using scikit-learnSummary
23 Case study 5 solution
23.1 Exploring the data23.1.1 Examining the profiles23.1.2 Exploring the experimental observations23.1.3 Exploring the Friendships linkage table23.2 Training a predictive model using network features23.3 Adding profile features to the model23.4 Optimizing performance across a steady set of features23.5 Interpreting the trained model23.5.1 Why are generalizable models so important?Summary
index
inside back cover

Content preview from Data Science Bookcamp

17 Case study 4 solution

This section covers

Parsing text from HTML
Computing text similarities
Clustering and exploring large text datasets

We have downloaded thousands of job postings by searching on this book’s table of contents for case studies 1 through 4 (see the problem statement for details). Besides the downloaded postings, we also have at our disposal two text files: resume.txt and table_of_contents.txt. The first file contains a resume draft, and the second contains the truncated table of contents used to query for job listing results. Our goal is to extract common data science skills from the downloaded job postings. Then we’ll compare these skills to our resume to determine which skills are missing. We will do so as follows:

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781617296253Publisher Support Publisher Website Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science Bookcamp

by Leonard Apeltsin

17 Case study 4 solution

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.