O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Effective Amazon Machine Learning

Book Description

Learn to leverage Amazon's powerful platform for your predictive analytics needs

About This Book

  • Create great machine learning models that combine the power of algorithms with interactive tools without worrying about the underlying complexity
  • Learn the What’s next? of machine learning—machine learning on the cloud—with this unique guide
  • Create web services that allow you to perform affordable and fast machine learning on the cloud

Who This Book Is For

This book is intended for data scientists and managers of predictive analytics projects; it will teach beginner- to advanced-level machine learning practitioners how to leverage Amazon Machine Learning and complement their existing Data Science toolbox.

No substantive prior knowledge of Machine Learning, Data Science, statistics, or coding is required.

What You Will Learn

  • Learn how to use the Amazon Machine Learning service from scratch for predictive analytics
  • Gain hands-on experience of key Data Science concepts
  • Solve classic regression and classification problems
  • Run projects programmatically via the command line and the Python SDK
  • Leverage the Amazon Web Service ecosystem to access extended data sources
  • Implement streaming and advanced projects

In Detail

Predictive analytics is a complex domain requiring coding skills, an understanding of the mathematical concepts underpinning machine learning algorithms, and the ability to create compelling data visualizations. Following AWS simplifying Machine learning, this book will help you bring predictive analytics projects to fruition in three easy steps: data preparation, model tuning, and model selection.

This book will introduce you to the Amazon Machine Learning platform and will implement core data science concepts such as classification, regression, regularization, overfitting, model selection, and evaluation. Furthermore, you will learn to leverage the Amazon Web Service (AWS) ecosystem for extended access to data sources, implement realtime predictions, and run Amazon Machine Learning projects via the command line and the Python SDK.

Towards the end of the book, you will also learn how to apply these services to other problems, such as text mining, and to more complex datasets.

Style and approach

This book will include use cases you can relate to. In a very practical manner, you will explore the various capabilities of Amazon Machine Learning services, allowing you to implementing them in your environment with consummate ease.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. Introduction to Machine Learning and Predictive Analytics
    1. Introducing Amazon Machine Learning
      1. Machine Learning as a Service
      2. Leveraging full AWS integration
      3. Comparing performances
        1. Engineering data versus model variety
        2. Amazon's expertise and the gradient descent algorithm
      4. Pricing
    2. Understanding predictive analytics
      1. Building the simplest predictive analytics algorithm
      2. Regression versus classification
      3. Expanding regression to classification with logistic regression
      4. Extracting features to predict outcomes
    3. Diving further into linear modeling for prediction
      1. Validating the dataset
      2. Missing from Amazon ML
      3. The statistical approach versus the machine learning approach
    4. Summary
  3. Machine Learning Definitions and Concepts
    1. What's an algorithm? What's a model?
    2. Dealing with messy data
      1. Classic datasets versus real-world datasets
      2. Assumptions for multiclass linear models
      3. Missing values
      4. Normalization
      5. Imbalanced datasets
      6. Addressing multicollinearity
      7. Detecting outliers
      8. Accepting non-linear patterns
      9. Adding features?
      10. Preprocessing recapitulation
    3. The predictive analytics workflow
      1. Training and evaluation in Amazon ML
    4. Identifying and correcting poor performances
      1. Underfitting
      2. Overfitting
      3. Regularization on linear models
        1. L2 regularization and Ridge
        2. L1 regularization and Lasso
    5. Evaluating the performance of your model
    6. Summary
  4. Overview of an Amazon Machine Learning Workflow
    1. Opening an Amazon Web Services Account
      1. Security
    2. Setting up the account
      1. Creating a user
      2. Defining policies
      3. Creating login credentials
        1. Choosing a region
    3. Overview of a standard Amazon Machine Learning workflow
      1. The dataset
        1. Loading the data on S3
        2. Declaring a datasource
        3. Creating the datasource
      2. The model
      3. The evaluation of the model
        1. Comparing with a baseline
      4. Making batch predictions
    4. Summary
  5. Loading and Preparing the Dataset
    1. Working with datasets
      1. Finding open datasets
      2. Introducing the Titanic dataset
    2. Preparing the data
      1. Splitting the data
      2. Loading data on S3
        1. Creating a bucket
        2. Loading the data
        3. Granting permissions
        4. Formatting the data
    3. Creating the datasource
      1. Verifying the data schema
      2. Reusing the schema
    4. Examining data statistics
    5. Feature engineering with Athena
      1. Introducing Athena
        1. A brief tour of AWS Athena
      2. Creating a titanic database
        1. Using the wizard
        2. Creating the database and table directly in SQL
      3. Data munging in SQL
        1. Missing values
        2. Handling outliers in the fare
        3. Extracting the title from the name
        4. Inferring the deck from the cabin
        5. Calculating family size
        6. Wrapping up
      4. Creating an improved datasource
    6. Summary
  6. Model Creation
    1. Transforming data with recipes
      1. Managing variables
        1. Grouping variables
        2. Naming variables with assignments
        3. Specifying outputs
      2. Data processing through seven transformations
        1. Using simple transformations
        2. Text mining
        3. Coupling variables
        4. Binning numeric values
    2. Creating a model
      1. Editing the suggested recipe
        1. Applying recipes to the Titanic dataset
        2. Choosing between recipes and data pre-processing.
      2. Parametrizing the model
        1. Setting model memory
        2. Setting the number of data passes
        3. Choosing regularization
    3. Creating an evaluation
      1. Evaluating the model
        1. Evaluating binary classification
        2. Exploring the model performances
        3. Evaluating linear regression
        4. Evaluating multiclass classification
    4. Analyzing the logs
      1. Optimizing the learning rate
        1. Visualizing convergence
        2. Impact of regularization
        3. Comparing different recipes on the Titanic dataset
        4. Keeping variables as numeric or applying quantile binning?
        5. Parsing the model logs
    5. Summary
  7. Predictions and Performances
    1. Making batch predictions
      1. Creating the batch prediction job
      2. Interpreting prediction outputs
        1. Reading the manifest file
        2. Reading the results file
        3. Assessing our predictions
        4. Evaluating the held-out dataset
        5. Finding out who will survive
        6. Multiplying trials
    2. Making real-time predictions
      1. Manually exploring variable influence
      2. Setting up real-time predictions
        1. AWS SDK
        2. Setting up AWS credentials
          1. AWS access keys
          2. Setting up AWS CLI
        3. Python SDK
    3. Summary
  8. Command Line and SDK
    1. Getting started and setting up
      1. Using the CLI versus SDK
      2. Installing AWS CLI
      3. Picking up CLI syntax
      4. Passing parameters using JSON files
      5. Introducing the Ames Housing dataset
      6. Splitting the dataset with shell commands
    2. A simple project using the CLI
      1. An overview of Amazon ML CLI commands
      2. Creating the datasource
      3. Creating the model
      4. Evaluating our model with create-evaluation
      5. What is cross-validation?
      6. Implementing Monte Carlo cross-validation
        1. Generating the shuffled datasets
        2. Generating the datasources template
        3. Generating the models template
        4. Generating the evaluations template
        5. The results
      7. Conclusion
    3. Boto3, the Python SDK
      1. Working with the Python SDK for Amazon Machine Learning
        1. Waiting on operation completion
        2. Wrapping up the Python-based workflow
      2. Implementing recursive feature selection with Boto3
        1. Managing schema and recipe
    4. Summary
  9. Creating Datasources from Redshift
    1. Choosing between RDS and Redshift
      1. Creating a Redshift instance
        1. Connecting through the command line
      2. Executing Redshift queries using Psql
      3. Creating our own non-linear dataset
        1. Uploading the nonlinear data to Redshift
    2. Introducing polynomial regression
      1. Establishing a baseline
    3. Polynomial regression in Amazon ML
      1. Driving the trials in Python
      2. Interpreting the results
    4. Summary
  10. Building a Streaming Data Analysis Pipeline
    1. Streaming Twitter sentiment analysis
      1. Popularity contest on twitter
      2. The training dataset and the model
      3. Kinesis
        1. Kinesis Stream
        2. Kinesis Analytics
        3. Setting up Kinesis Firehose
      4. Producing tweets
      5. The Redshift database
      6. Adding Redshift to the Kinesis Firehose
        1. Setting up the roles and policies
        2. Dependencies and debugging
          1. Data format synchronization
          2. Debugging
      7. Preprocessing with Lambda
      8. Analyzing the results
        1. Download the dataset from RedShift
        2. Sentiment analysis with TextBlob
        3. Removing duplicate tweets
        4. And what is the most popular vegetable?
    2. Going beyond classification and regression
    3. Summary