book

Large Scale Machine Learning with Python

Name: Large Scale Machine Learning with Python
ISBN: 9781785887215

by Luca Massaron, Alberto Boschetti, Bastiaan Sjardin

August 2016

Intermediate to advanced

420 pages

9h 35m

English

Packt Publishing

Read now

Unlock full access

Large Scale Machine Learning with Python
Table of Contents
Large Scale Machine Learning with Python
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and moreWhy subscribe?
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. First Steps to Scalability
Explaining scalability in detailMaking large scale examplesIntroducing PythonScale up with PythonScale out with Python
Python for large scale machine learning
Choosing between Python 2 and Python 3Installing PythonStep-by-step installationThe installation of packagesPackage upgradesScientific distributionsIntroducing Jupyter/IPython
Python packages
NumPySciPyPandasScikit-learnThe matplotlib packageGensimH2OXGBoostTheanoTensorFlowThe sknn libraryTheanetsKerasOther useful packages to install on your system
Summary
2. Scalable Learning in Scikit-learn
Out-of-core learningSubsampling as a viable optionOptimizing one instance at a timeBuilding an out-of-core learning system
Streaming data from sources
Datasets to try the real thing yourselfThe first example – streaming the bike-sharing datasetUsing pandas I/O toolsWorking with databasesPaying attention to the ordering of instances
Stochastic learning
Batch gradient descentStochastic gradient descentThe Scikit-learn SGD implementationDefining SGD learning parameters
Feature management with data streams
Describing the targetThe hashing trickOther basic transformationsTesting and validation in a streamTrying SGD in action
Summary
3. Fast SVM Implementations
Datasets to experiment with on your ownThe bike-sharing datasetThe covertype dataset
Support Vector Machines
Hinge loss and its variantsUnderstanding the Scikit-learn SVM implementationPursuing nonlinear SVMs by subsamplingAchieving SVM at scale with SGD
Feature selection by regularization
Including non-linearity in SGD
Trying explicit high-dimensional mappings
Hyperparameter tuning
Other alternatives for SVM fast learningNonlinear and faster with Vowpal WabbitInstalling VWUnderstanding the VW data formatPython integrationA few examples using reductions for SVM and neural netsFaster bike-sharingThe covertype dataset crunched by VW
Summary
4. Neural Networks and Deep Learning
The neural network architectureWhat and how neural networks learnChoosing the right architectureThe input layerThe hidden layerThe output layerNeural networks in actionParallelization for sknn
Neural networks and regularization
Neural networks and hyperparameter optimization
Neural networks and decision boundaries
Deep learning at scale with H2O
Large scale deep learning with H2OGridsearch on H2O
Deep learning and unsupervised pretraining
Deep learning with theanets
Autoencoders and unsupervised learning
Autoencoders
Summary
5. Deep Learning with TensorFlow
TensorFlow installationTensorFlow operationsGPU computingLinear regression with SGDA neural network from scratch in TensorFlow
Machine learning on TensorFlow with SkFlow
Deep learning with large files – incremental learning
Keras and TensorFlow installation
Convolutional Neural Networks in TensorFlow through Keras
The convolution layerThe pooling layerThe fully connected layer
CNN's with an incremental approach
GPU Computing
Summary
6. Classification and Regression Trees at Scale
Bootstrap aggregation
Random forest and extremely randomized forest
Fast parameter optimization with randomized search
Extremely randomized trees and large datasets
CART and boosting
Gradient Boosting Machinesmax_depthlearning_rateSubsampleFaster GBM with warm_startSpeeding up GBM with warm_startTraining and storing GBM models
XGBoost
XGBoost regressionXGBoost and variable importanceXGBoost streaming large datasetsXGBoost model persistence
Out-of-core CART with H2O
Random forest and gridsearch on H2OStochastic gradient boosting and gridsearch on H2O
Summary
7. Unsupervised Learning at Scale
Unsupervised methods
Feature decomposition – PCA
Randomized PCAIncremental PCASparse PCA
PCA with H2O
Clustering – K-means
Initialization methodsK-means assumptionsSelection of the best KScaling K-means – mini-batch
K-means with H2O
LDA
Scaling LDA – memory, CPUs, and machines
Summary
8. Distributed Environments – Hadoop and Spark
From a standalone machine to a bunch of nodesWhy do we need a distributed framework?
Setting up the VM
VirtualBoxVagrantUsing the VM
The Hadoop ecosystem
ArchitectureHDFSMapReduceYARN
Spark
pySpark
Summary
9. Practical Machine Learning with Spark
Setting up the VM for this chapter
Sharing variables across cluster nodes
Broadcast read-only variablesAccumulators write-only variablesBroadcast and accumulators together – an example
Data preprocessing in Spark
JSON files and Spark DataFramesDealing with missing dataGrouping and creating tables in-memoryWriting the preprocessed DataFrame or RDD to diskWorking with Spark DataFrames
Machine learning with Spark
Spark on the KDD99 datasetReading the datasetFeature engineeringTraining a learnerEvaluating a learner's performanceThe power of the ML pipelineManual tuningCross-validationFinal cleanup
Summary
A. Introduction to GPUs and Theano
GPU computing
Theano – parallel computing on the GPU
Installing Theano
Index

Content preview from Large Scale Machine Learning with Python

Clustering – K-means

K-means is an unsupervised algorithm that creates K disjoint clusters of points with equal variance, minimizing the distortion (also named inertia).

Given only one parameter K, representing the number of clusters to be created, the K-means algorithm creates K sets of points S₁, S₂, …, S_K, each of them represented by its centroid: C₁, C₂, …, C_K. The generic centroid, C_i, is simply the mean of the samples of the points associated to the cluster Si in order to minimize the intra-cluster distance. The outputs of the system are as follows:

The composition of the clusters S₁, S₂, …, S_K, that is, the set of points composing the training set that are associated to the cluster number 1, 2, …, K.
The centroids of each cluster, C₁, C₂

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Interpretable Machine Learning with Python

Publisher Resources

ISBN: 9781785887215

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Large Scale Machine Learning with Python

by Luca Massaron, Alberto Boschetti, Bastiaan Sjardin

Clustering – K-means

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.