Machine Learning Solutions

Book description

Practical, hands-on solutions in Python to overcome any problem in Machine Learning

About This Book
  • Master the advanced concepts, methodologies, and use cases of machine learning
  • Build ML applications for analytics, NLP and computer vision domains
  • Solve the most common problems in building machine learning models
Who This Book Is For

This book is for the intermediate users such as machine learning engineers, data engineers, data scientists, and more, who want to solve simple to complex machine learning problems in their day-to-day work and build powerful and efficient machine learning models. A basic understanding of the machine learning concepts and some experience with Python programming is all you need to get started with this book.

What You Will Learn
  • Select the right algorithm to derive the best solution in ML domains
  • Perform predictive analysis effciently using ML algorithms
  • Predict stock prices using the stock index value
  • Perform customer analytics for an e-commerce platform
  • Build recommendation engines for various domains
  • Build NLP applications for the health domain
  • Build language generation applications using different NLP techniques
  • Build computer vision applications such as facial emotion recognition
In Detail

Machine learning (ML) helps you find hidden insights from your data without the need for explicit programming. This book is your key to solving any kind of ML problem you might come across in your job.

You'll encounter a set of simple to complex problems while building ML models, and you'll not only resolve these problems, but you'll also learn how to build projects based on each problem, with a practical approach and easy-to-follow examples.

The book includes a wide range of applications: from analytics and NLP, to computer vision domains. Some of the applications you will be working on include stock price prediction, a recommendation engine, building a chat-bot, a facial expression recognition system, and many more. The problem examples we cover include identifying the right algorithm for your dataset and use cases, creating and labeling datasets, getting enough clean data to carry out processing, identifying outliers, overftting datasets, hyperparameter tuning, and more. Here, you'll also learn to make more timely and accurate predictions.

In addition, you'll deal with more advanced use cases, such as building a gaming bot, building an extractive summarization tool for medical documents, and you'll also tackle the problems faced while building an ML model. By the end of this book, you'll be able to fine-tune your models as per your needs to deliver maximum productivity.

Style and approach

This book is a step-by-step guide on how to develop machine learning applications for various domains. Each chapter of this book contains the practical guide on how to build specific machine learning applications from its base-line approach to the best possible approach. Basic necessary concepts, conman mistakes for every approach and optimization techniques are discussed for each application.

Table of contents

  1. Machine Learning Solutions
    1. Table of Contents
    2. Machine Learning Solutions
      1. Why subscribe?
      2. PacktPub.com
    3. Foreword
    4. Contributors
      1. About the author
      2. About the reviewer
      3. Packt is Searching for Authors Like You
    5. Preface
      1. Who this book is for
      2. What this book covers
      3. To get the most out of this book
        1. Download the example code files
        2. Conventions used
      4. Get in touch
        1. Reviews
    6. 1. Credit Risk Modeling
      1. Introducing the problem statement
      2. Understanding the dataset
        1. Understanding attributes of the dataset
        2. Data analysis
          1. Data preprocessing
            1. First change
            2. Second change
            3. Implementing the changes
          2. Basic data analysis followed by data preprocessing
            1. Listing statistical properties
            2. Finding missing values
            3. Replacing missing values
            4. Correlation
            5. Detecting outliers
            6. Outliers detection techniques
            7. Percentile-based outlier detection
            8. Median Absolute Deviation (MAD)-based outlier detection
            9. Standard Deviation (STD)-based outlier detection
            10. Majority-vote-based outlier detection:
            11. Visualization of outliers
            12. Handling outliers
            13. Revolving utilization of unsecured lines
            14. Age
            15. Number of time 30-59 days past due not worse
            16. Debt ratio
            17. Monthly income
            18. Number of open credit lines and loans
            19. Number of times 90 days late
            20. Number of real estate loans or lines
            21. Number of times 60-89 days past due not worse
          3. Number of dependents
      3. Feature engineering for the baseline model
        1. Finding out Feature importance
      4. Selecting machine learning algorithms
        1. K-Nearest Neighbor (KNN)
        2. Logistic regression
        3. AdaBoost
        4. GradientBoosting
        5. RandomForest
      5. Training the baseline model
      6. Understanding the testing matrix
        1. The Mean accuracy of the trained models
        2. The ROC-AUC score
          1. ROC
          2. AUC
      7. Testing the baseline model
      8. Problems with the existing approach
      9. Optimizing the existing approach
        1. Understanding key concepts to optimize the approach
          1. Cross-validation
            1. The approach of using CV
          2. Hyperparameter tuning
            1. Grid search parameter tuning
            2. Random search parameter tuning
      10. Implementing the revised approach
        1. Implementing a cross-validation based approach
        2. Implementing hyperparameter tuning
        3. Implementing and testing the revised approach
        4. Understanding problems with the revised approach
      11. Best approach
        1. Implementing the best approach
          1. Log transformation of features
          2. Voting-based ensemble ML model
          3. Running ML models on real test data
      12. Summary
    7. 2. Stock Market Price Prediction
      1. Introducing the problem statement
      2. Collecting the dataset
        1. Collecting DJIA index prices
        2. Collecting news articles
      3. Understanding the dataset
        1. Understanding the DJIA dataset
        2. Understanding the NYTimes news article dataset
      4. Data preprocessing and data analysis
        1. Preparing the DJIA training dataset
        2. Basic data analysis for a DJIA dataset
        3. Preparing the NYTimes news dataset
          1. Converting publication date into the YYYY-MM-DD format
          2. Filtering news articles by category
          3. Implementing the filter functionality and merging the dataset
          4. Saving the merged dataset in the pickle file format
      5. Feature engineering
        1. Loading the dataset
        2. Minor preprocessing
          1. Converting adj close price into the integer format
          2. Removing the leftmost dot from news headlines
        3. Feature engineering
        4. Sentiment analysis of NYTimes news articles
      6. Selecting the Machine Learning algorithm
      7. Training the baseline model
        1. Splitting the training and testing dataset
        2. Splitting prediction labels for the training and testing datasets
        3. Converting sentiment scores into the numpy array
        4. Training of the ML model
      8. Understanding the testing matrix
        1. The default testing matrix
        2. The visualization approach
      9. Testing the baseline model
        1. Generating and interpreting the output
        2. Generating the accuracy score
        3. Visualizing the output
      10. Exploring problems with the existing approach
        1. Alignment
        2. Smoothing
        3. Trying a different ML algorithm
      11. Understanding the revised approach
        1. Understanding concepts and approaches
          1. Alignment-based approach
          2. Smoothing-based approach
          3. Logistic Regression-based approach
      12. Implementing the revised approach
        1. Implementation
          1. Implementing alignment
          2. Implementing smoothing
          3. Implementing logistic regression
        2. Testing the revised approach
        3. Understanding the problem with the revised approach
      13. The best approach
      14. Summary
    8. 3. Customer Analytics
      1. Introducing customer segmentation
        1. Introducing the problem statement
      2. Understanding the datasets
        1. Description of the dataset
        2. Downloading the dataset
        3. Attributes of the dataset
      3. Building the baseline approach
        1. Implementing the baseline approach
          1. Data preparation
            1. Loading the dataset
          2. Exploratory data analysis (EDA)
            1. Removing null data entries
            2. Removing duplicate data entries
            3. EDA for various data attributes
              1. Country
              2. Customer and products
              3. Product categories
                1. Analyzing the product description
                2. Defining product categories
              4. Characterizing the content of clusters
                1. Silhouette intra-cluster score analysis
              5. Analysis using a word cloud
              6. Principal component analysis (PCA)
          3. Generating customer categories
            1. Formatting data
              1. Grouping products
              2. Splitting the dataset
              3. Grouping orders
            2. Creating customer categories
              1. Data encoding
              2. Generating customer categories
              3. PCA analysis
              4. Analyzing the cluster using silhouette scores
          4. Classifying customers
            1. Defining helper functions
            2. Splitting the data into training and testing
            3. Implementing the Machine Learning (ML) algorithm
        2. Understanding the testing matrix
          1. Confusion matrix
          2. Learning curve
        3. Testing the result of the baseline approach
          1. Generating the accuracy score for classifier
          2. Generating the confusion matrix for the classifier
          3. Generating the learning curve for the classifier
        4. Problems with the baseline approach
        5. Optimizing the baseline approach
      4. Building the revised approach
        1. Implementing the revised approach
        2. Testing the revised approach
        3. Problems with the revised approach
          1. Understanding how to improve the revised approach
      5. The best approach
        1. Implementing the best approach
        2. Testing the best approach
          1. Transforming the hold-out corpus in the form of the training dataset
          2. Converting the transformed dataset into a matrix form
          3. Generating the predictions
      6. Customer segmentation for various domains
      7. Summary
    9. 4. Recommendation Systems for E-Commerce
      1. Introducing the problem statement
      2. Understanding the datasets
        1. e-commerce Item Data
        2. The Book-Crossing dataset
          1. BX-Book-Ratings.csv
          2. BX-Books.csv
          3. BX-Users.csv
      3. Building the baseline approach
        1. Understanding the basic concepts
          1. Understanding the content-based approach
        2. Implementing the baseline approach
          1. Architecture of the recommendation system
          2. Steps for implementing the baseline approach
            1. Loading the dataset
            2. Generating features using TF-IDF
            3. Building the cosine similarity matrix
            4. Generating the prediction
        3. Understanding the testing matrix
        4. Testing the result of the baseline approach
        5. Problems with the baseline approach
        6. Optimizing the baseline approach

      4. Building the revised approach
        1. Implementing the revised approach
          1. Loading dataset
          2. EDA of the book-rating datafile
          3. Exploring the book datafile
          4. EDA of the user datafile
          5. Implementing the logic of correlation for the recommendation engine
            1. Recommendations based on the rating of the books
            2. Recommendations based on correlations
        2. Testing the revised approach
        3. Problems with the revised approach
          1. Understanding how to improve the revised approach
      5. The best approach
        1. Understanding the key concepts
          1. Collaborative filtering
            1. Memory-based CF
              1. User-user collaborative filtering
              2. Item-item collaborative filtering
            2. Model-based CF
              1. Matrix-factorization-based algorithms
              2. Difference between memory-based CF and model-based CF
        2. Implementing the best approach
          1. Loading the dataset
          2. Merging the data frames
          3. EDA for the merged data frames
          4. Filtering data based on geolocation
          5. Applying the KNN algorithm
          6. Recommendation using the KNN algorithm
          7. Applying matrix factorization
          8. Recommendation using matrix factorization
      6. Summary
    10. 5. Sentiment Analysis
      1. Introducing problem statements
      2. Understanding the dataset
        1. Understanding the content of the dataset
          1. Train folder
          2. Test folder
          3. imdb.vocab file
          4. imdbEr.txt file
          5. README
        2. Understanding the contents of the movie review files
      3. Building the training and testing datasets for the baseline model
      4. Feature engineering for the baseline model
      5. Selecting the machine learning algorithm
      6. Training the baseline model
        1. Implementing the baseline model
          1. Multinomial naive Bayes
          2. C-support vector classification with kernel rbf
          3. C-support vector classification with kernel linear
          4. Linear support vector classification
      7. Understanding the testing matrix
        1. Precision
        2. Recall
        3. F1-Score
        4. Support
        5. Training accuracy
      8. Testing the baseline model
        1. Testing of Multinomial naive Bayes
        2. Testing of SVM with rbf kernel
        3. Testing SVM with the linear kernel
        4. Testing SVM with linearSVC
      9. Problem with the existing approach
      10. How to optimize the existing approach
        1. Understanding key concepts for optimizing the approach
      11. Implementing the revised approach
        1. Importing the dependencies
        2. Downloading and loading the IMDb dataset
        3. Choosing the top words and the maximum text length
        4. Implementing word embedding
        5. Building a convolutional neural net (CNN)
        6. Training and obtaining the accuracy
        7. Testing the revised approach
        8. Understanding problems with the revised approach
      12. The best approach
        1. Implementing the best approach
          1. Loading the glove model
          2. Loading the dataset
          3. Preprocessing
          4. Loading precomputed ID matrix
          5. Splitting the train and test datasets
          6. Building a neural network
          7. Training the neural network
          8. Loading the trained model
          9. Testing the trained model
      13. Summary
    11. 6. Job Recommendation Engine
      1. Introducing the problem statement
      2. Understanding the datasets
        1. Scraped dataset
        2. Job recommendation challenge dataset
        3. apps.tsv
        4. users.tsv
        5. Jobs.zip
        6. user_history.tsv
      3. Building the baseline approach
        1. Implementing the baseline approach
          1. Defining constants
          2. Loading the dataset
          3. Defining the helper function
          4. Generating TF-IDF vectors and cosine similarity
            1. Building the training dataset
            2. Generating IF-IDF vectors for the training dataset
            3. Building the testing dataset
            4. Generating the similarity score
        2. Understanding the testing matrix
        3. Problems with the baseline approach
        4. Optimizing the baseline approach
      4. Building the revised approach
        1. Loading the dataset
        2. Splitting the training and testing datasets
        3. Exploratory Data Analysis
        4. Building the recommendation engine using the jobs datafile
        5. Testing the revised approach
        6. Problems with the revised approach
        7. Understanding how to improve the revised approach
      5. The best approach
        1. Implementing the best approach
          1. Filtering the dataset
          2. Preparing the training dataset
          3. Applying the concatenation operation
          4. Generating the TF-IDF and cosine similarity score
          5. Generating recommendations
      6. Summary
    12. 7. Text Summarization
      1. Understanding the basics of summarization
        1. Extractive summarization
        2. Abstractive summarization
      2. Introducing the problem statement
      3. Understanding datasets
        1. Challenges in obtaining the dataset
        2. Understanding the medical transcription dataset
        3. Understanding Amazon's review dataset
      4. Building the baseline approach
        1. Implementing the baseline approach
          1. Installing python dependencies
          2. Writing the code and generating the summary
        2. Problems with the baseline approach
        3. Optimizing the baseline approach
      5. Building the revised approach
        1. Implementing the revised approach
          1. The get_summarized function
          2. The reorder_sentences function
          3. The summarize function
          4. Generating the summary
        2. Problems with the revised approach
        3. Understanding how to improve the revised approach
          1. The LSA algorithm
          2. The idea behind the best approach
      6. The best approach
        1. Implementing the best approach
          1. Understanding the structure of the project
          2. Understanding helper functions
            1. Normalization.py
            2. Utils.py
          3. Generating the summary
        2. Building the summarization application using Amazon reviews
          1. Loading the dataset
          2. Exploring the dataset
          3. Preparing the dataset
          4. Building the DL model
          5. Training the DL model
          6. Testing the DL model
      7. Summary
    13. 8. Developing Chatbots
      1. Introducing the problem statement
        1. Retrieval-based approach
        2. Generative-based approach
        3. Open domain
        4. Closed domain
        5. Short conversation
        6. Long conversation
          1. Open domain and generative-based approach
          2. Open domain and retrieval-based approach
          3. Closed domain and retrieval-based approach
          4. Closed domain and generative-based approach
      2. Understanding datasets
        1. Cornell Movie-Dialogs dataset
          1. Content details of movie_conversations.txt
          2. Content details of movie_lines.txt
        2. The bAbI dataset
          1. The (20) QA bAbI tasks
      3. Building the basic version of a chatbot
        1. Why does the rule-based system work?
        2. Understanding the rule-based system
        3. Understanding the approach
        4. Listing down possible questions and answers
        5. Deciding standard messages
        6. Understanding the architecture
      4. Implementing the rule-based chatbot
        1. Implementing the conversation flow
        2. Implementing RESTful APIs using flask
      5. Testing the rule-based chatbot
        1. Advantages of the rule-based chatbot
      6. Problems with the existing approach
        1. Understanding key concepts for optimizing the approach
          1. Understanding the seq2seq model
      7. Implementing the revised approach
        1. Data preparation
          1. Generating question-answer pairs
          2. Preprocessing the dataset
          3. Splitting the dataset into the training dataset and the testing dataset
          4. Building a vocabulary for the training and testing datasets
        2. Implementing the seq2seq model
          1. Creating the model
          2. Training the model
      8. Testing the revised approach
        1. Understanding the testing metrics
          1. Perplexity
          2. Loss
        2. Testing the revised version of the chatbot
      9. Problems with the revised approach
        1. Understanding key concepts to solve existing problems
          1. Memory networks
            1. Dynamic memory network (DMN)
              1. Input module
              2. Question module
              3. Episodic memory
      10. The best approach
        1. Implementing the best approach
          1. Random testing mode
            1. User interactive testing mode
      11. Discussing the hybrid approach
      12. Summary
    14. 9. Building a Real-Time Object Recognition App
      1. Introducing the problem statement
      2. Understanding the dataset
        1. The COCO dataset
        2. The PASCAL VOC dataset
          1. PASCAL VOC classes
      3. Transfer Learning
        1. What is Transfer Learning?
        2. What is a pre-trained model?
        3. Why should we use a pre-trained model?
        4. How can we use a pre-trained model?
      4. Setting up the coding environment
        1. Setting up and installing OpenCV
      5. Features engineering for the baseline model
      6. Selecting the machine learning algorithm
        1. Architecture of the MobileNet SSD model
      7. Building the baseline model
      8. Understanding the testing metrics
        1. Intersection over Union (IoU)
        2. mean Average Precision
      9. Testing the baseline model
      10. Problem with existing approach
      11. How to optimize the existing approach
        1. Understanding the process for optimization
      12. Implementing the revised approach
        1. Testing the revised approach
        2. Understanding problems with the revised approach
      13. The best approach
        1. Understanding YOLO
        2. The working of YOLO
        3. The architecture of YOLO
        4. Implementing the best approach using YOLO
          1. Implementation using Darknet
            1. Environment setup for Darknet
            2. Compiling the Darknet
            3. Downloading the pre-trained weight
            4. Running object detection for the image
            5. Running the object detection on the video stream
          2. Implementation using Darkflow
            1. Installing Cython
            2. Building the already provided setup file
            3. Testing the environment
            4. Loading the model and running object detection on images
            5. Loading the model and running object detection on the video stream
      14. Summary
    15. 10. Face Recognition and Face Emotion Recognition
      1. Introducing the problem statement
        1. Face recognition application
        2. Face emotion recognition application
      2. Setting up the coding environment
        1. Installing dlib
        2. Installing face_recognition
      3. Understanding the concepts of face recognition
        1. Understanding the face recognition dataset
          1. CAS-PEAL Face Dataset
          2. Labeled Faces in the Wild
        2. Algorithms for face recognition
          1. Histogram of Oriented Gradients (HOG)
          2. Convolutional Neural Network (CNN) for FR
            1. Simple CNN architecture
            2. Understanding how CNN works for FR
      4. Approaches for implementing face recognition
        1. Implementing the HOG-based approach
        2. Implementing the CNN-based approach
        3. Implementing real-time face recognition
      5. Understanding the dataset for face emotion recognition
      6. Understanding the concepts of face emotion recognition
        1. Understanding the convolutional layer
        2. Understanding the ReLU layer
        3. Understanding the pooling layer
        4. Understanding the fully connected layer
        5. Understanding the SoftMax layer
        6. Updating the weight based on backpropagation
      7. Building the face emotion recognition model
        1. Preparing the data
        2. Loading the data
        3. Training the model
          1. Loading the data using the dataset_loader script
          2. Building the Convolutional Neural Network
          3. Training for the FER application
          4. Predicting and saving the trained model
      8. Understanding the testing matrix
      9. Testing the model
      10. Problems with the existing approach
      11. How to optimize the existing approach
        1. Understanding the process for optimization
      12. The best approach
        1. Implementing the best approach
      13. Summary
    16. 11. Building Gaming Bot
      1. Introducing the problem statement
      2. Setting up the coding environment
      3. Understanding Reinforcement Learning (RL)
        1. Markov Decision Process (MDP)
        2. Discounted Future Reward
      4. Basic Atari gaming bot
        1. Understanding the key concepts
          1. Rules for the game
          2. Understanding the Q-Learning algorithm
      5. Implementing the basic version of the gaming bot
      6. Building the Space Invaders gaming bot
        1. Understanding the key concepts
          1. Understanding a deep Q-network (DQN)
            1. Architecture of DQN
            2. Steps for the DQN algorithm
          2. Understanding Experience Replay
      7. Implementing the Space Invaders gaming bot
      8. Building the Pong gaming bot
        1. Understanding the key concepts
          1. Architecture of the gaming bot
          2. Approach for the gaming bot
      9. Implementing the Pong gaming bot
        1. Initialization of the parameters
        2. Weights stored in the form of matrices
        3. Updating weights
        4. How to move the agent
        5. Understanding the process using NN
      10. Just for fun - implementing the Flappy Bird gaming bot
      11. Summary
    17. A. List of Cheat Sheets
      1. Cheat sheets
      2. Summary
    18. B. Strategy for Wining Hackathons
      1. Strategy for winning hackathons
      2. Keeping up to date
      3. Summary
    19. Index

Product information

  • Title: Machine Learning Solutions
  • Author(s): Jalaj Thanaki
  • Release date: April 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781788390040