book

Machine Learning Solutions

Name: Machine Learning Solutions
Author: Jalaj Thanaki
ISBN: 9781788390040

by Jalaj Thanaki

April 2018

Beginner to intermediate

566 pages

12h 17m

English

Packt Publishing

Read now

Unlock full access

Machine Learning Solutions
Table of Contents
Machine Learning Solutions
Why subscribe?
PacktPub.com
Foreword
Contributors
About the author
About the reviewer
Packt is Searching for Authors Like You
Preface
Who this book is for

What this book covers
To get the most out of this book
Download the example code filesConventions used
Get in touch
Reviews
1. Credit Risk Modeling
Introducing the problem statement
Understanding the dataset
Understanding attributes of the datasetData analysisData preprocessingFirst changeSecond changeImplementing the changesBasic data analysis followed by data preprocessingListing statistical propertiesFinding missing valuesReplacing missing valuesCorrelationDetecting outliersOutliers detection techniquesPercentile-based outlier detectionMedian Absolute Deviation (MAD)-based outlier detectionStandard Deviation (STD)-based outlier detectionMajority-vote-based outlier detection:Visualization of outliersHandling outliersRevolving utilization of unsecured linesAgeNumber of time 30-59 days past due not worseDebt ratioMonthly incomeNumber of open credit lines and loansNumber of times 90 days lateNumber of real estate loans or linesNumber of times 60-89 days past due not worseNumber of dependents
Feature engineering for the baseline model
Finding out Feature importance
Selecting machine learning algorithms
K-Nearest Neighbor (KNN)Logistic regressionAdaBoostGradientBoostingRandomForest
Training the baseline model
Understanding the testing matrix
The Mean accuracy of the trained modelsThe ROC-AUC scoreROCAUC
Testing the baseline model
Problems with the existing approach
Optimizing the existing approach
Understanding key concepts to optimize the approachCross-validationThe approach of using CVHyperparameter tuningGrid search parameter tuningRandom search parameter tuning
Implementing the revised approach
Implementing a cross-validation based approachImplementing hyperparameter tuningImplementing and testing the revised approachUnderstanding problems with the revised approach
Best approach
Implementing the best approachLog transformation of featuresVoting-based ensemble ML modelRunning ML models on real test data
Summary
2. Stock Market Price Prediction
Introducing the problem statement
Collecting the dataset
Collecting DJIA index pricesCollecting news articles
Understanding the dataset
Understanding the DJIA datasetUnderstanding the NYTimes news article dataset
Data preprocessing and data analysis
Preparing the DJIA training datasetBasic data analysis for a DJIA datasetPreparing the NYTimes news datasetConverting publication date into the YYYY-MM-DD formatFiltering news articles by categoryImplementing the filter functionality and merging the datasetSaving the merged dataset in the pickle file format
Feature engineering
Loading the datasetMinor preprocessingConverting adj close price into the integer formatRemoving the leftmost dot from news headlinesFeature engineeringSentiment analysis of NYTimes news articles
Selecting the Machine Learning algorithm
Training the baseline model
Splitting the training and testing datasetSplitting prediction labels for the training and testing datasetsConverting sentiment scores into the numpy arrayTraining of the ML model
Understanding the testing matrix
The default testing matrixThe visualization approach
Testing the baseline model
Generating and interpreting the outputGenerating the accuracy scoreVisualizing the output
Exploring problems with the existing approach
AlignmentSmoothingTrying a different ML algorithm
Understanding the revised approach
Understanding concepts and approachesAlignment-based approachSmoothing-based approachLogistic Regression-based approach
Implementing the revised approach
ImplementationImplementing alignmentImplementing smoothingImplementing logistic regressionTesting the revised approachUnderstanding the problem with the revised approach
The best approach
Summary
3. Customer Analytics
Introducing customer segmentationIntroducing the problem statement
Understanding the datasets
Description of the datasetDownloading the datasetAttributes of the dataset
Building the baseline approach
Implementing the baseline approachData preparationLoading the datasetExploratory data analysis (EDA)Removing null data entriesRemoving duplicate data entriesEDA for various data attributesCountryCustomer and productsProduct categoriesAnalyzing the product descriptionDefining product categoriesCharacterizing the content of clustersSilhouette intra-cluster score analysisAnalysis using a word cloudPrincipal component analysis (PCA)Generating customer categoriesFormatting dataGrouping productsSplitting the datasetGrouping ordersCreating customer categoriesData encodingGenerating customer categoriesPCA analysisAnalyzing the cluster using silhouette scoresClassifying customersDefining helper functionsSplitting the data into training and testingImplementing the Machine Learning (ML) algorithmUnderstanding the testing matrixConfusion matrixLearning curveTesting the result of the baseline approachGenerating the accuracy score for classifierGenerating the confusion matrix for the classifierGenerating the learning curve for the classifierProblems with the baseline approachOptimizing the baseline approach
Building the revised approach
Implementing the revised approachTesting the revised approachProblems with the revised approachUnderstanding how to improve the revised approach
The best approach
Implementing the best approachTesting the best approachTransforming the hold-out corpus in the form of the training datasetConverting the transformed dataset into a matrix formGenerating the predictions
Customer segmentation for various domains
Summary
4. Recommendation Systems for E-Commerce
Introducing the problem statement
Understanding the datasets
e-commerce Item DataThe Book-Crossing datasetBX-Book-Ratings.csvBX-Books.csvBX-Users.csv
Building the baseline approach
Understanding the basic conceptsUnderstanding the content-based approachImplementing the baseline approachArchitecture of the recommendation systemSteps for implementing the baseline approachLoading the datasetGenerating features using TF-IDFBuilding the cosine similarity matrixGenerating the predictionUnderstanding the testing matrixTesting the result of the baseline approachProblems with the baseline approachOptimizing the baseline approach 
Building the revised approach
Implementing the revised approachLoading datasetEDA of the book-rating datafileExploring the book datafileEDA of the user datafileImplementing the logic of correlation for the recommendation engineRecommendations based on the rating of the booksRecommendations based on correlationsTesting the revised approachProblems with the revised approachUnderstanding how to improve the revised approach
The best approach
Understanding the key conceptsCollaborative filteringMemory-based CFUser-user collaborative filteringItem-item collaborative filteringModel-based CFMatrix-factorization-based algorithmsDifference between memory-based CF and model-based CFImplementing the best approachLoading the datasetMerging the data framesEDA for the merged data framesFiltering data based on geolocationApplying the KNN algorithmRecommendation using the KNN algorithmApplying matrix factorizationRecommendation using matrix factorization
Summary
5. Sentiment Analysis
Introducing problem statements
Understanding the dataset
Understanding the content of the datasetTrain folderTest folderimdb.vocab fileimdbEr.txt fileREADMEUnderstanding the contents of the movie review files
Building the training and testing datasets for the baseline model
Feature engineering for the baseline model
Selecting the machine learning algorithm
Training the baseline model
Implementing the baseline modelMultinomial naive BayesC-support vector classification with kernel rbfC-support vector classification with kernel linearLinear support vector classification
Understanding the testing matrix
PrecisionRecallF1-ScoreSupportTraining accuracy
Testing the baseline model
Testing of Multinomial naive BayesTesting of SVM with rbf kernelTesting SVM with the linear kernelTesting SVM with linearSVC
Problem with the existing approach
How to optimize the existing approach
Understanding key concepts for optimizing the approach
Implementing the revised approach
Importing the dependenciesDownloading and loading the IMDb datasetChoosing the top words and the maximum text lengthImplementing word embeddingBuilding a convolutional neural net (CNN)Training and obtaining the accuracyTesting the revised approachUnderstanding problems with the revised approach
The best approach
Implementing the best approachLoading the glove modelLoading the datasetPreprocessingLoading precomputed ID matrixSplitting the train and test datasetsBuilding a neural networkTraining the neural networkLoading the trained modelTesting the trained model
Summary
6. Job Recommendation Engine
Introducing the problem statement
Understanding the datasets
Scraped datasetJob recommendation challenge datasetapps.tsvusers.tsvJobs.zipuser_history.tsv
Building the baseline approach
Implementing the baseline approachDefining constantsLoading the datasetDefining the helper functionGenerating TF-IDF vectors and cosine similarityBuilding the training datasetGenerating IF-IDF vectors for the training datasetBuilding the testing datasetGenerating the similarity scoreUnderstanding the testing matrixProblems with the baseline approachOptimizing the baseline approach
Building the revised approach
Loading the datasetSplitting the training and testing datasetsExploratory Data AnalysisBuilding the recommendation engine using the jobs datafileTesting the revised approachProblems with the revised approachUnderstanding how to improve the revised approach
The best approach
Implementing the best approachFiltering the datasetPreparing the training datasetApplying the concatenation operationGenerating the TF-IDF and cosine similarity scoreGenerating recommendations
Summary
7. Text Summarization
Understanding the basics of summarizationExtractive summarizationAbstractive summarization
Introducing the problem statement
Understanding datasets
Challenges in obtaining the datasetUnderstanding the medical transcription datasetUnderstanding Amazon's review dataset
Building the baseline approach
Implementing the baseline approachInstalling python dependenciesWriting the code and generating the summaryProblems with the baseline approachOptimizing the baseline approach
Building the revised approach
Implementing the revised approachThe get_summarized functionThe reorder_sentences functionThe summarize functionGenerating the summaryProblems with the revised approachUnderstanding how to improve the revised approachThe LSA algorithmThe idea behind the best approach
The best approach
Implementing the best approachUnderstanding the structure of the projectUnderstanding helper functionsNormalization.pyUtils.pyGenerating the summaryBuilding the summarization application using Amazon reviewsLoading the datasetExploring the datasetPreparing the datasetBuilding the DL modelTraining the DL modelTesting the DL model
Summary
8. Developing Chatbots
Introducing the problem statementRetrieval-based approachGenerative-based approachOpen domainClosed domainShort conversationLong conversationOpen domain and generative-based approachOpen domain and retrieval-based approachClosed domain and retrieval-based approachClosed domain and generative-based approach
Understanding datasets
Cornell Movie-Dialogs datasetContent details of movie_conversations.txtContent details of movie_lines.txtThe bAbI datasetThe (20) QA bAbI tasks
Building the basic version of a chatbot
Why does the rule-based system work?Understanding the rule-based systemUnderstanding the approachListing down possible questions and answersDeciding standard messagesUnderstanding the architecture
Implementing the rule-based chatbot
Implementing the conversation flowImplementing RESTful APIs using flask
Testing the rule-based chatbot
Advantages of the rule-based chatbot
Problems with the existing approach
Understanding key concepts for optimizing the approachUnderstanding the seq2seq model
Implementing the revised approach
Data preparationGenerating question-answer pairsPreprocessing the datasetSplitting the dataset into the training dataset and the testing datasetBuilding a vocabulary for the training and testing datasetsImplementing the seq2seq modelCreating the modelTraining the model
Testing the revised approach
Understanding the testing metricsPerplexityLossTesting the revised version of the chatbot
Problems with the revised approach
Understanding key concepts to solve existing problemsMemory networksDynamic memory network (DMN)Input moduleQuestion moduleEpisodic memory
The best approach
Implementing the best approachRandom testing modeUser interactive testing mode
Discussing the hybrid approach
Summary
9. Building a Real-Time Object Recognition App
Introducing the problem statement
Understanding the dataset
The COCO datasetThe PASCAL VOC datasetPASCAL VOC classes
Transfer Learning
What is Transfer Learning?What is a pre-trained model?Why should we use a pre-trained model?How can we use a pre-trained model?
Setting up the coding environment
Setting up and installing OpenCV
Features engineering for the baseline model
Selecting the machine learning algorithm
Architecture of the MobileNet SSD model
Building the baseline model
Understanding the testing metrics
Intersection over Union (IoU)mean Average Precision
Testing the baseline model
Problem with existing approach
How to optimize the existing approach
Understanding the process for optimization
Implementing the revised approach
Testing the revised approachUnderstanding problems with the revised approach
The best approach
Understanding YOLOThe working of YOLOThe architecture of YOLOImplementing the best approach using YOLOImplementation using DarknetEnvironment setup for DarknetCompiling the DarknetDownloading the pre-trained weightRunning object detection for the imageRunning the object detection on the video streamImplementation using DarkflowInstalling CythonBuilding the already provided setup fileTesting the environmentLoading the model and running object detection on imagesLoading the model and running object detection on the video stream
Summary
10. Face Recognition and Face Emotion Recognition
Introducing the problem statementFace recognition applicationFace emotion recognition application
Setting up the coding environment
Installing dlibInstalling face_recognition
Understanding the concepts of face recognition
Understanding the face recognition datasetCAS-PEAL Face DatasetLabeled Faces in the WildAlgorithms for face recognitionHistogram of Oriented Gradients (HOG)Convolutional Neural Network (CNN) for FRSimple CNN architectureUnderstanding how CNN works for FR
Approaches for implementing face recognition
Implementing the HOG-based approachImplementing the CNN-based approachImplementing real-time face recognition
Understanding the dataset for face emotion recognition
Understanding the concepts of face emotion recognition
Understanding the convolutional layerUnderstanding the ReLU layerUnderstanding the pooling layerUnderstanding the fully connected layerUnderstanding the SoftMax layerUpdating the weight based on backpropagation
Building the face emotion recognition model
Preparing the dataLoading the dataTraining the modelLoading the data using the dataset_loader scriptBuilding the Convolutional Neural NetworkTraining for the FER applicationPredicting and saving the trained model
Understanding the testing matrix
Testing the model
Problems with the existing approach
How to optimize the existing approach
Understanding the process for optimization
The best approach
Implementing the best approach
Summary
11. Building Gaming Bot
Introducing the problem statement
Setting up the coding environment
Understanding Reinforcement Learning (RL)
Markov Decision Process (MDP)Discounted Future Reward
Basic Atari gaming bot
Understanding the key conceptsRules for the gameUnderstanding the Q-Learning algorithm
Implementing the basic version of the gaming bot
Building the Space Invaders gaming bot
Understanding the key conceptsUnderstanding a deep Q-network (DQN)Architecture of DQNSteps for the DQN algorithmUnderstanding Experience Replay
Implementing the Space Invaders gaming bot
Building the Pong gaming bot
Understanding the key conceptsArchitecture of the gaming botApproach for the gaming bot
Implementing the Pong gaming bot
Initialization of the parametersWeights stored in the form of matricesUpdating weightsHow to move the agentUnderstanding the process using NN
Just for fun - implementing the Flappy Bird gaming bot
Summary
A. List of Cheat Sheets
Cheat sheets
Summary
B. Strategy for Wining Hackathons
Strategy for winning hackathons
Keeping up to date
Summary
Index

Content preview from Machine Learning Solutions

Summary

In this chapter, we built the summarization application for medical transcriptions. In the beginning, we listed the challenges in order to generate a good parallel corpus for the summarization task in the medical domain. After that, for our baseline approach, we used the already available Python libraries, such as PyTeaser and Sumy. In the revised approach, we used word frequencies to generate the summary of the medical document. In the best possible approach, we combined the word frequency-based approach and the ranking mechanism in order to generate a summary for medical notes.

In the end, we developed a solution, where we used Amazon's review dataset, which is the parallel corpus for the summarization task, and we built the deep learning-based ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781788390040

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Machine Learning Solutions

by Jalaj Thanaki

Summary

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Machine Learning

Graph-Powered Machine Learning

Machine Learning for Business

Introducing Machine Learning

Publisher Resources