book

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

by Tarek Amr

July 2020

Intermediate to advanced

384 pages

8h 38m

English

Packt Publishing

Read now

Unlock full access

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Why subscribe?
About the authorAbout the reviewersPackt is searching for authors like you
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchReviews
Understanding machine learningTypes of machine learning algorithmsSupervised learningClassification versus regressionSupervised learning evaluationUnsupervised learningReinforcement learningThe model development life cycleUnderstanding a problemSplitting our dataFinding the best manner to split the dataMaking sure the training and the test datasets are separateDevelopment setEvaluating our modelDeploying in production and monitoringIteratingWhen to use machine learningIntroduction to scikit-learnIt plays well with the Python data ecosystemPractical level of abstractionWhen not to use scikit-learnInstalling the packages you needIntroduction to pandasPython's scientific computing ecosystem conventionsSummaryFurther reading
Understanding decision treesWhat are decision trees?Iris classificationLoading the Iris datasetSplitting the dataTraining the model and using it for predictionEvaluating our predictionsWhich features were more important?Displaying the internal tree decisions How do decision trees learn? Splitting criteriaPreventing overfittingPredictionsGetting a more reliable scoreWhat to do now to get a more reliable scoreShuffleSplitTuning the hyperparameters for higher accuracySplitting the dataTrying different hyperparameter valuesComparing the accuracy scoresVisualizing the tree's decision boundariesFeature engineeringBuilding decision tree regressorsPredicting people's heightsRegressor's evaluation Setting sample weightsSummary
Understanding linear modelsLinear equationsLinear regressionEstimating the amount paid to the taxi driverPredicting house prices in BostonData explorationSplitting the dataCalculating a baseline Training the linear regressorEvaluating our model's accuracyShowing feature coefficients Scaling for more meaningful coefficientsAdding polynomial featuresFitting the linear regressor with the derived featuresRegularizing the regressorTraining the lasso regressorFinding the optimum regularization parameterFinding regression intervalsGetting to know additional linear regressorsUsing logistic regression for classificationUnderstanding the logistic functionPlugging the logistic function into a linear modelObjective functionRegularizationSolversConfiguring the logistic regression classifierClassifying the Iris dataset using logistic regressionUnderstanding the classifier's decision boundariesGetting to know additional linear classifiersSummary
Imputing missing valuesSetting missing values to 0Setting missing values to the meanUsing informed estimations for missing valuesEncoding non-numerical columnsOne-hot encodingOrdinal encodingTarget encodingHomogenizing the columns' scaleThe standard scalerThe MinMax scalerRobustScalerSelecting the most useful featuresVarianceThresholdFiltersf-regression and f-classifMutual informationComparing and using the different filtersEvaluating multiple features at a timeSummary

Nearest neighborsLoading and displaying imagesImage classificationUsing a confusion matrix to understand the model's mistakesPicking a suitable metricSetting the correct KHyperparameter tuning using GridSearchCVUsing custom distancesUsing nearest neighbors for regressionMore neighborhood algorithms Radius neighbors Nearest centroid classifierReducing the dimensions of our image dataPrincipal component analysisNeighborhood component analysisComparing PCA to NCAPicking the most informative components Using the centroid classifier with PCA Restoring the original image from its components Finding the most informative pixels Summary
Splitting sentences into tokensTokenizing with string splitTokenizing using regular expressionsUsing placeholders before tokenizingVectorizing text into matricesVector space modelBag of wordsDifferent sentences, same representationN-gramsUsing characters instead of wordsCapturing important words with TF-IDFRepresenting meanings with word embeddingWord2VecUnderstanding Naive BayesThe Bayes rule Calculating the likelihood naively Naive Bayes implementationsAdditive smoothingClassifying text using a Naive Bayes classifierDownloading the dataPreparing the dataPrecision, recall, and F1 scorePipelinesOptimizing for different scoresCreating a custom transformerSummary
Getting to know MLPUnderstanding the algorithm's architecture Training the neural networkConfiguring the solvers Classifying items of clothing Downloading the Fashion-MNIST datasetPreparing the data for classificationExperiencing the effects of the hyperparameters Learning not too quickly and not too slowlyPicking a suitable batch sizeChecking whether more training samples are neededChecking whether more epochs are neededChoosing the optimum architecture and hyperparameters Adding your own activation functionUntangling the convolutionsExtracting features by convolvingReducing the dimensionality of the data via max poolingPutting it all togetherMLP regressorsSummary
Answering the question why ensembles? Combining multiple estimators via averagingBoosting multiple biased estimators Downloading the UCI Automobile datasetDealing with missing valuesDifferentiating between numerical features and categorical onesSplitting the data into training and test setsImputing the missing values and encoding the categorical featuresUsing random forest for regressionChecking the effect of the number of treesUnderstanding the effect of each training featureUsing random forest for classificationThe ROC curveUsing bagging regressorsPreparing a mixture of numerical and categorical featuresCombining KNN estimators using a bagging meta-estimatorUsing gradient boosting to predict automobile pricesPlotting the learning devianceComparing the learning rate settingsUsing different sample sizesStopping earlier and adapting the learning rateRegression ranges Using AdaBoost ensembles Exploring more ensemblesVoting ensembles Stacking ensembles Random tree embeddingSummary
Scaling your regression targetsEstimating multiple regression targets Building a multi-output regressor Chaining multiple regressors Dealing with compound classification targetsConverting a multi-class problem into a set of binary classifiersEstimating multiple classification targets Calibrating a classifier's probabilities Calculating the precision at kSummary
Getting the click prediction dataset Installing the imbalanced-learn libraryPredicting the CTRWeighting the training samples differentlyThe effect of the weighting on the ROCSampling the training dataUndersampling the majority classOversampling the minority classCombining data sampling with ensembles Equal opportunity scoreSummary
Understanding clusteringK-means clusteringCreating a blob-shaped datasetVisualizing our sample dataClustering with K-meansThe silhouette scoreChoosing the initial centroidsAgglomerative clusteringTracing the agglomerative clustering's childrenThe adjusted Rand indexChoosing the cluster linkage DBSCANSummary
Unlabeled anomaly detectionGenerating sample dataDetecting anomalies using basic statisticsUsing percentiles for multi-dimensional dataDetecting outliers using EllipticEnvelopeOutlier and novelty detection using LOFNovelty detection using LOFDetecting outliers using isolation forestSummary
The different recommendation paradigmsDownloading surprise and the dataset Downloading the KDD Cup 2012 datasetProcessing and splitting the datasetCreating a random recommenderUsing KNN-inspired algorithmsUsing baseline algorithmsUsing singular value decompositionExtracting latent information via SVD Comparing the similarity measures for the two matricesClick prediction using SVDDeploying machine learning models in productionSummary
Leave a review - let other readers know what you think

Content preview from Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Preparing Your Data

In the previous chapter, we dealt with clean data, where all the values were available to us, all the columns had numeric values, and when faced with too many features, we had a regularization technique on our side. In real life, it will often be the case that the data is not as clean as you would like it to be. Sometimes, even clean data can still be preprocessed in ways to make things easier for our machine learning algorithm. In this chapter, we will learn about the following data preprocessing techniques: