O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hands-On Data Science and Python Machine Learning

Book Description

This book covers the fundamentals of machine learning with Python in a concise and dynamic manner. It covers data mining and large-scale machine learning using Apache Spark.

About This Book

  • Take your first steps in the world of data science by understanding the tools and techniques of data analysis
  • Train efficient Machine Learning models in Python using the supervised and unsupervised learning methods
  • Learn how to use Apache Spark for processing Big Data efficiently

Who This Book Is For

If you are a budding data scientist or a data analyst who wants to analyze and gain actionable insights from data using Python, this book is for you. Programmers with some experience in Python who want to enter the lucrative world of Data Science will also find this book to be very useful, but you don't need to be an expert Python coder or mathematician to get the most from this book.

What You Will Learn

  • Learn how to clean your data and ready it for analysis
  • Implement the popular clustering and regression methods in Python
  • Train efficient machine learning models using decision trees and random forests
  • Visualize the results of your analysis using Python’s Matplotlib library
  • Use Apache Spark’s MLlib package to perform machine learning on large datasets

In Detail

Join Frank Kane, who worked on Amazon and IMDb’s machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them.

Based on Frank’s successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis.

Style and approach

This comprehensive book is a perfect blend of theory and hands-on code examples in Python which can be used for your reference at any time.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. Who this book is for
    2. Conventions
    3. Reader feedback
    4. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Getting Started
    1. Installing Enthought Canopy
      1. Giving the installation a test run
        1. If you occasionally get problems opening your IPNYB files
    2. Using and understanding IPython (Jupyter) Notebooks
    3. Python basics - Part 1
    4. Understanding Python code
    5. Importing modules
      1. Data structures
      2. Experimenting with lists
        1. Pre colon
        2. Post colon
        3. Negative syntax
        4. Adding list to list
        5. The append function
        6. Complex data structures
        7. Dereferencing a single element
        8. The sort function
        9. Reverse sort
      3. Tuples
        1. Dereferencing an element
        2. List of tuples
        3. Dictionaries
      4. Iterating through entries
    6. Python basics - Part 2
      1. Functions in Python
        1. Lambda functions - functional programming
        2. Understanding boolean expressions
          1. The if statement
          2. The if-else loop
      2. Looping
        1. The while loop
      3. Exploring activity
    7. Running Python scripts
      1. More options than just the IPython/Jupyter Notebook
      2. Running Python scripts in command prompt
      3. Using the Canopy IDE
    8. Summary
  3. Statistics and Probability Refresher, and Python Practice
    1. Types of data
      1. Numerical data
        1. Discrete data
        2. Continuous data
      2. Categorical data
      3. Ordinal data
    2. Mean, median, and mode
      1. Mean
      2. Median
        1. The factor of outliers
      3. Mode
    3. Using mean, median, and mode in Python
      1. Calculating mean using the NumPy package
        1. Visualizing data using matplotlib
      2. Calculating median using the NumPy package
        1. Analyzing the effect of outliers
      3. Calculating mode using the SciPy package
        1. Some exercises
    4. Standard deviation and variance
      1. Variance
        1. Measuring variance
      2. Standard deviation
        1. Identifying outliers with standard deviation
      3. Population variance versus sample variance
        1. The Mathematical explanation
      4. Analyzing standard deviation and variance on a histogram
      5. Using Python to compute standard deviation and variance
      6. Try it yourself
    5. Probability density function and probability mass function
      1. The probability density function and probability mass functions
        1. Probability density functions
        2. Probability mass functions
    6. Types of data distributions
      1. Uniform distribution
      2. Normal or Gaussian distribution
      3. The exponential probability distribution or Power law
      4. Binomial probability mass function
      5. Poisson probability mass function
    7. Percentiles and moments
      1. Percentiles
        1. Quartiles
        2. Computing percentiles in Python
      2. Moments
        1. Computing moments in Python
    8. Summary
  4. Matplotlib and Advanced Probability Concepts
    1. A crash course in Matplotlib
      1. Generating multiple plots on one graph
      2. Saving graphs as images
      3. Adjusting the axes
      4. Adding a grid
      5. Changing line types and colors
      6. Labeling axes and adding a legend
      7. A fun example
      8. Generating pie charts
      9. Generating bar charts
      10. Generating scatter plots
      11. Generating histograms
      12. Generating box-and-whisker plots
      13. Try it yourself
    2. Covariance and correlation
      1. Defining the concepts
        1. Measuring covariance
      2. Correlation
      3. Computing covariance and correlation in Python
        1. Computing correlation – The hard way
        2. Computing correlation – The NumPy way
      4. Correlation activity
    3. Conditional probability
      1. Conditional probability exercises in Python
      2. Conditional probability assignment
      3. My assignment solution
    4. Bayes' theorem
    5. Summary
  5. Predictive Models
    1. Linear regression
      1. The ordinary least squares technique
      2. The gradient descent technique
      3. The co-efficient of determination or r-squared
        1. Computing r-squared
        2. Interpreting r-squared
      4. Computing linear regression and r-squared using Python
      5. Activity for linear regression
    2. Polynomial regression
      1. Implementing polynomial regression using NumPy
        1. Computing the r-squared error
      2. Activity for polynomial regression
    3. Multivariate regression and predicting car prices
      1. Multivariate regression using Python
      2. Activity for multivariate regression
    4. Multi-level models
    5. Summary
  6. Machine Learning with Python
    1. Machine learning and train/test
      1. Unsupervised learning
      2. Supervised learning
        1. Evaluating supervised learning
        2. K-fold cross validation
    2. Using train/test to prevent overfitting of a polynomial regression
      1. Activity
    3. Bayesian methods - Concepts
    4. Implementing a spam classifier with Naïve Bayes
      1. Activity
    5. K-Means clustering
      1. Limitations to k-means clustering
    6. Clustering people based on income and age
      1. Activity
    7. Measuring entropy
    8. Decision trees - Concepts
      1. Decision tree example
      2. Walking through a decision tree
      3. Random forests technique
    9. Decision trees - Predicting hiring decisions using Python
      1. Ensemble learning – Using a random forest
      2. Activity
    10. Ensemble learning
    11. Support vector machine overview
    12. Using SVM to cluster people by using scikit-learn
      1. Activity
    13. Summary
  7. Recommender Systems
    1. What are recommender systems?
      1. User-based collaborative filtering
        1. Limitations of user-based collaborative filtering
    2. Item-based collaborative filtering
      1. Understanding item-based collaborative filtering
    3. How item-based collaborative filtering works?
      1. Collaborative filtering using Python
    4. Finding movie similarities
      1. Understanding the code
        1. The corrwith function
    5. Improving the results of movie similarities
    6. Making movie recommendations to people
      1. Understanding movie recommendations with an example
        1. Using the groupby command to combine rows
        2. Removing entries with the drop command
    7. Improving the recommendation results
    8. Summary
  8. More Data Mining and Machine Learning Techniques
    1. K-nearest neighbors - concepts
    2. Using KNN to predict a rating for a movie
      1. Activity
    3. Dimensionality reduction and principal component analysis
      1. Dimensionality reduction
      2. Principal component analysis
    4. A PCA example with the Iris dataset
      1. Activity
    5. Data warehousing overview
      1. ETL versus ELT
    6. Reinforcement learning
      1. Q-learning
      2. The exploration problem
        1. The simple approach
        2. The better way
      3. Fancy words
        1. Markov decision process
        2. Dynamic programming
    7. Summary
  9. Dealing with Real-World Data
    1. Bias/variance trade-off
    2. K-fold cross-validation to avoid overfitting
      1. Example of k-fold cross-validation using scikit-learn
    3. Data cleaning and normalisation
    4. Cleaning web log data
      1. Applying a regular expression on the web log
      2. Modification one - filtering the request field
      3. Modification two - filtering post requests
      4. Modification three - checking the user agents
        1. Filtering the activity of spiders/robots
      5. Modification four - applying website-specific filters
      6. Activity for web log data
    5. Normalizing numerical data
    6. Detecting outliers
      1. Dealing with outliers
      2. Activity for outliers
    7. Summary
  10. Apache Spark - Machine Learning on Big Data
    1. Installing Spark
      1. Installing Spark on Windows
      2. Installing Spark on other operating systems
      3. Installing the Java Development Kit
      4. Installing Spark
    2. Spark introduction
      1. It's scalable
      2. It's fast
      3. It's young
      4. It's not difficult
      5. Components of Spark
      6. Python versus Scala for Spark
    3. Spark and Resilient Distributed Datasets (RDD)
      1. The SparkContext object
      2. Creating RDDs
        1. Creating an RDD using a Python list
        2. Loading an RDD from a text file
      3. More ways to create RDDs
      4. RDD operations
        1. Transformations
          1. Using map()
        2. Actions
    4. Introducing MLlib
      1. Some MLlib Capabilities
      2. Special MLlib data types
        1. The vector data type
        2. LabeledPoint data type
        3. Rating data type
    5. Decision Trees in Spark with MLlib
      1. Exploring decision trees code
        1. Creating the SparkContext
        2. Importing and cleaning our data
        3. Creating a test candidate and building our decision tree
        4. Running the script
    6. K-Means Clustering in Spark
      1. Within set sum of squared errors (WSSSE)
        1. Running the code
    7. TF-IDF
      1. TF-IDF in practice
      2. Using TF- IDF
    8. Searching wikipedia with Spark MLlib
      1. Import statements
      2. Creating the initial RDD
      3. Creating and transforming a HashingTF object
      4. Computing the TF-IDF score
      5. Using the Wikipedia search engine algorithm
        1. Running the algorithm
    9. Using the Spark 2.0 DataFrame API for MLlib
      1. How Spark 2.0 MLlib works
        1. Implementing linear regression
    10. Summary
  11. Testing and Experimental Design
    1. A/B testing concepts
      1. A/B tests
      2. Measuring conversion for A/B testing
        1. How to attribute conversions
      3. Variance is your enemy
    2. T-test and p-value
      1. The t-statistic or t-test
      2. The p-value
    3. Measuring t-statistics and p-values using Python
      1. Running A/B test on some experimental data
        1. When there's no real difference between the two groups
      2. Does the sample size make a difference?
        1. Sample size increased to six-digits
        2. Sample size increased seven-digits
        3. A/A testing
    4. Determining how long to run an experiment for
    5. A/B test gotchas
      1. Novelty effects
      2. Seasonal effects
      3. Selection bias
        1. Auditing selection bias issues
      4. Data pollution
      5. Attribution errors
    6. Summary