O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Scala for Machine Learning - Second Edition

Book Description

Leverage Scala and Machine Learning to study and construct systems that can learn from data

About This Book

  • Explore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulation, and updated source code in Scala
  • Take your expertise in Scala programming to the next level by creating and customizing AI applications
  • Experiment with different techniques and evaluate their benefits and limitations using real-world applications in a tutorial style

Who This Book Is For

If you’re a data scientist or a data analyst with a fundamental knowledge of Scala who wants to learn and implement various Machine learning techniques, this book is for you. All you need is a good understanding of the Scala programming language, a basic knowledge of statistics, a keen interest in Big Data processing, and this book!

What You Will Learn

  • Build dynamic workflows for scientific computing
  • Leverage open source libraries to extract patterns from time series
  • Write your own classification, clustering, or evolutionary algorithm
  • Perform relative performance tuning and evaluation of Spark
  • Master probabilistic models for sequential data
  • Experiment with advanced techniques such as regularization and kernelization
  • Dive into neural networks and some deep learning architecture
  • Apply some basic multiarm-bandit algorithms
  • Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters
  • Apply key learning strategies to a technical analysis of financial markets

In Detail

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies.

The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naïve Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You’ll move on to evolutionary computing, multibandit algorithms, and reinforcement learning.

Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.

Style and approach

This book is designed as a tutorial with hands-on exercises using technical analysis of financial markets and corporate data. The approach of each chapter is such that it allows you to understand key concepts easily.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Scala for Machine Learning Second Edition
    1. Table of Contents
    2. Scala for Machine Learning Second Edition
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Getting Started
      1. Mathematical notations for the curious
      2. Why machine learning?
        1. Classification
        2. Prediction
        3. Optimization
        4. Regression
      3. Why Scala?
        1. Scala as a functional language
          1. Abstraction
          2. Higher kinded types
          3. Functors
          4. Monads
        2. Scala as an object oriented language
        3. Scala as a scalable language
      4. Model categorization
      5. Taxonomy of machine learning algorithms
        1. Unsupervised learning
          1. Clustering
          2. Dimension reduction
        2. Supervised learning
          1. Generative models
        3. Discriminative models
        4. Semi-supervised learning
        5. Reinforcement learning
      6. Leveraging Java libraries
      7. Tools and frameworks
        1. Java
        2. Scala
          1. Eclipse Scala IDE
          2. IntelliJ IDEA Scala plugin
        3. Simple build tool
        4. Apache Commons Math
          1. Description
          2. Licensing
          3. Installation
        5. JFreeChart
          1. Description
          2. Licensing
          3. Installation
        6. Other libraries and frameworks
      8. Source code
        1. Convention
          1. Context bounds
          2. Presentation
          3. Primitives and implicits
          4. Immutability
      9. Let's kick the tires
        1. Writing a simple workflow
          1. Step 1 – scoping the problem
          2. Step 2 – loading data
          3. Step 3 – preprocessing data
            1. Immutable normalization
          4. Step 4 – discovering patterns
            1. Analyzing data
            2. Plotting data
              1. Visualizing model features
              2. Visualizing label
          5. Step 5 – implementing the classifier
            1. Selecting an optimizer
            2. Training the model
            3. Classifying observations
          6. Step 6 – evaluating the model
      10. Summary
    10. 2. Data Pipelines
      1. Modeling
        1. What is a model?
        2. Model versus design
        3. Selecting features
        4. Extracting features
      2. Defining a methodology
      3. Monadic data transformation
        1. Error handling
        2. Monads to the rescue
          1. Implicit models
          2. Explicit models
      4. Workflow computational model
        1. Supporting mathematical abstractions
          1. Step 1 – variable declaration
          2. Step 2 – model definition
          3. Step 3 – instantiation
        2. Composing mixins to build workflow
          1. Understanding the problem
          2. Defining modules
          3. Instantiating the workflow
        3. Modularizing
      5. Profiling data
        1. Immutable statistics
        2. Z-score and Gauss
      6. Assessing a model
        1. Validation
          1. Key quality metrics
          2. F-score for binomial classification
          3. F-score for multinomial classification
        2. Area under the curves
          1. Area under PRC
          2. Area under ROC
        3. Cross-validation
          1. One-fold cross-validation
          2. K-fold cross-validation
        4. Bias-variance decomposition
        5. Overfitting
      7. Summary
    11. 3. Data Preprocessing
      1. Time series in Scala
        1. Context bounds
        2. Types and operations
          1. Transpose operator
          2. Differential operator
        3. Lazy views
      2. Moving averages
        1. Simple moving average
        2. Weighted moving average
        3. Exponential moving average
      3. Fourier analysis
        1. Discrete Fourier transform (DFT)
        2. DFT-based filtering
        3. Detection of market cycles
      4. The discrete Kalman filter
        1. The state space estimation
          1. The transition equation
          2. The measurement equation
        2. The recursive algorithm
          1. Prediction
          2. Correction
          3. Kalman smoothing
          4. Fixed lag smoothing
          5. Experimentation
          6. Benefits and drawbacks
      5. Alternative preprocessing techniques
      6. Summary
    12. 4. Unsupervised Learning
      1. K-mean clustering
        1. K-means
          1. Measuring similarity
          2. Defining the algorithm
          3. Step 1 – Clusters configuration
            1. Defining clusters
            2. Initializing clusters
          4. Step 2 – Clusters assignment
          5. Step 3 – Reconstruction error minimization
            1. Creating K-means components
            2. Tail recursive implementation
            3. Iterative implementation
          6. Step 4 – Classification
          7. Curse of dimensionality
          8. Evaluation
          9. The results
          10. Tuning the number of clusters
          11. Validation
      2. Expectation-Maximization (EM)
        1. Gaussian mixture model
        2. EM overview
        3. Implementation
        4. Classification
        5. Testing
        6. Online EM
      3. Summary
    13. 5. Dimension Reduction
      1. Challenging model complexity
      2. The divergences
        1. The Kullback-Leibler divergence
          1. Overview
          2. Implementation
          3. Testing
        2. The mutual information
      3. Principal components analysis (PCA)
        1. Algorithm
        2. Implementation
        3. Test case
        4. Evaluation
        5. Extending PCA
          1. Validation
          2. Categorical features
          3. Performance
      4. Nonlinear models
        1. Kernel PCA
        2. Manifolds
      5. Summary
    14. 6. Naïve Bayes Classifiers
      1. Probabilistic graphical models
      2. Naïve Bayes classifiers
        1. Introducing the multinomial Naïve Bayes
          1. Formalism
          2. The frequentist perspective
          3. The predictive model
          4. The zero-frequency problem
        2. Implementation
          1. Design
          2. Training
            1. Class likelihood
            2. Binomial model
            3. Multinomial model
            4. Classifier components
          3. Classification
          4. F1 Validation
          5. Features extraction
          6. Testing
      3. Multivariate Bernoulli classification
        1. Model
        2. Implementation
      4. Naïve Bayes and text mining
        1. Basics information retrieval
        2. Implementation
          1. Analyzing documents
          2. Extracting relative terms frequency
          3. Generating the features
        3. Testing
          1. Retrieving textual information
          2. Evaluating text mining classifier
      5. Pros and cons
      6. Summary
    15. 7. Sequential Data Models
      1. Markov decision processes
        1. The Markov property
        2. The first-order discrete Markov chain
      2. The hidden Markov model (HMM)
        1. Notation
        2. The lambda model
        3. Design
        4. Evaluation (CF-1)
          1. Alpha (forward pass)
          2. Beta (backward pass)
        5. Training (CF-2)
          1. Baum-Welch estimator (EM)
        6. Decoding (CF-3)
          1. The Viterbi algorithm
        7. Putting it all together
        8. Test case 1 – Training
        9. Test case 2 – Evaluation
        10. HMM as filtering technique
      3. Conditional random fields
        1. Introduction to CRF
        2. Linear chain CRF
      4. Regularized CRF and text analytics
        1. The feature functions model
        2. Design
        3. Implementation
          1. Configuring the CRF classifier
          2. Training the CRF model
          3. Applying the CRF model
        4. Tests
          1. The training convergence profile
          2. Impact of the size of the training set
          3. Impact of L2 regularization factor
      5. Comparing CRF and HMM
      6. Performance consideration
      7. Summary
    16. 8. Monte Carlo Inference
      1. The purpose of sampling
      2. Gaussian sampling
        1. Box-Muller transform
      3. Monte Carlo approximation
        1. Overview
        2. Implementation
      4. Bootstrapping with replacement
        1. Overview
        2. Resampling
        3. Implementation
        4. Pros and cons of bootstrap
      5. Markov Chain Monte Carlo (MCMC)
        1. Overview
        2. Metropolis-Hastings (MH)
        3. Implementation
        4. Test
      6. Summary
    17. 9. Regression and Regularization
      1. Linear regression
        1. Univariate linear regression
          1. Implementation
          2. Test case
        2. Ordinary least squares (OLS) regression
          1. Design
          2. Implementation
          3. Test case 1 – trending
          4. Test case 2 – features selection
      2. Regularization
        1. Ln roughness penalty
        2. Ridge regression
          1. Design
          2. Implementation
          3. Test case
      3. Numerical optimization
      4. Logistic regression
        1. Logistic function
        2. Design
        3. Training workflow
          1. Step 1 – configuring the optimizer
          2. Step 2 – computing the Jacobian matrix
          3. Step 3 – managing the convergence of optimizer
          4. Step 4 – defining the least squares problem
          5. Step 5 – minimizing the sum of square errors
          6. Test
        4. Classification
      5. Summary
    18. 10. Multilayer Perceptron
      1. Feed-forward neural networks (FFNN)
        1. The biological background
        2. Mathematical background
      2. The multilayer perceptron (MLP)
        1. Activation function
        2. Network topology
        3. Design
        4. Configuration
        5. Network components
          1. Network topology
          2. Input and hidden layers
          3. Output layer
          4. Synapses
          5. Connections
          6. Weights initialization
        6. Model
        7. Problem types (modes)
        8. Online versus batch training
        9. Training epoch
          1. Step 1 – input forward propagation
            1. Computational flow
            2. Error functions
            3. Operating modes
            4. Softmax
          2. Step 2 – error backpropagation
            1. Weights adjustment
            2. Error propagation
            3. The computational model
          3. Step 3 – exit condition
          4. Putting it all together
        10. Training and classification
          1. Regularization
          2. Model generation
          3. Fast Fisher-Yates shuffle
          4. Prediction
          5. Model fitness
      3. Evaluation
        1. Execution profile
        2. Impact of learning rate
        3. Impact of the momentum factor
        4. Impact of the number of hidden layers
        5. Test case
          1. Implementation
          2. Models evaluation
          3. Impact of hidden layers' architecture
      4. Benefits and limitations
      5. Summary
    19. 11. Deep Learning
      1. Sparse autoencoder
        1. Undercomplete autoencoder
        2. Deterministic autoencoder
        3. Categorization
        4. Feed-forward sparse, undercomplete autoencoder
        5. Sparsity updating equations
        6. Implementation
      2. Restricted Boltzmann Machines (RBMs)
        1. Boltzmann machine
        2. Binary restricted Boltzmann machines
          1. Conditional probabilities
          2. Sampling
          3. Log-likelihood gradient
          4. Contrastive divergence
          5. Configuration parameters
          6. Unsupervised learning
      3. Convolution neural networks
        1. Local receptive fields
        2. Weight sharing
        3. Convolution layers
        4. Sub-sampling layers
        5. Putting it all together
        6. Summary
    20. 12. Kernel Models and SVM
      1. Kernel functions
        1. Overview
        2. Common discriminative kernels
        3. Kernel monadic composition
      2. The support vector machine (SVM)
        1. The linear SVM
          1. The separable case (hard margin)
          2. The non-separable case (soft margin)
        2. The nonlinear SVM
          1. Max-margin classification
          2. The kernel trick
        3. Support vector classifier (SVC)
          1. The binary SVC
            1. LIBSVM
            2. Design
            3. Configuration parameters
              1. The SVM formulation
              2. The SVM kernel function
              3. The SVM execution
            4. Interface to LIBSVM
            5. Training
            6. Classification
            7. C-penalty and margin
            8. Kernel evaluation
            9. Application to risk analysis
        4. Anomaly detection with one-class SVC
        5. Support vector regression (SVR)
          1. Overview
          2. SVR versus linear regression
      3. Performance considerations
      4. Summary
    21. 13. Evolutionary Computing
      1. Evolution
        1. The origin
        2. NP problems
        3. Evolutionary computing
      2. Genetic algorithms and machine learning
      3. Genetic algorithm components
        1. Encodings
          1. Value encoding
          2. Predicate encoding
          3. Solution encoding
          4. The encoding scheme
            1. Flat encoding
            2. Hierarchical encoding
        2. Genetic operators
          1. Selection
          2. Crossover
          3. Mutation
        3. Fitness score
      4. Implementation
        1. Software design
        2. Key components
          1. Population
          2. Chromosomes
          3. Genes
        3. Selection
        4. Controlling population growth
        5. GA configuration
        6. Crossover
          1. Population
          2. Chromosomes
          3. Genes
        7. Mutation
          1. Population
          2. Chromosomes
          3. Genes
        8. Reproduction
        9. Solver
      5. GA for trading strategies
        1. Definition of trading strategies
          1. Trading operators
          2. The cost function
          3. Market signals
          4. Trading strategies
          5. Signal encoding
        2. Test case – Fall 2008 market crash
          1. Creating trading strategies
          2. Configuring the optimizer
          3. Finding the best trading strategy
          4. Tests
            1. The weighted score
            2. The unweighted score
      6. Advantages and risks of genetic algorithms
      7. Summary
    22. 14. Multiarmed Bandits
      1. K-armed bandit
        1. Exploration-exploitation trade-offs
        2. Expected cumulative regret
        3. Bayesian Bernoulli bandits
        4. Epsilon-greedy algorithm
      2. Thompson sampling
        1. Bandit context
        2. Prior/posterior beta distribution
        3. Implementation
        4. Simulated exploration and exploitation
      3. Upper bound confidence
        1. Confidence interval
        2. Implementation
      4. Summary
    23. 15. Reinforcement Learning
      1. Reinforcement learning
        1. Understanding the challenge
        2. A solution – Q-learning
          1. Terminology
          2. Concept
          3. Value of policy
          4. Bellman optimality equations
          5. Temporal difference for model-free learning
          6. Action-value iterative update
        3. Implementation
          1. Software design
          2. The states and actions
          3. The search space
          4. The policy and action-value
          5. The Q-learning components
          6. The Q-learning training
          7. Tail recursion to the rescue
          8. Validation
          9. The prediction
        4. Option trading using Q-learning
          1. Option property
          2. Option model
          3. Quantization
        5. Putting it all together
        6. Evaluation
        7. Pros and cons of reinforcement learning
      2. Learning classifier systems
        1. Introduction to LCS
        2. Combining learning and evolution
        3. Terminology
          1. Extended learning classifier systems
          2. XCS components
          3. Application to portfolio management
          4. XCS core data
          5. XCS rules
          6. Covering
          7. Example of implementation
          8. Benefits and limitations of learning classifier systems
      3. Summary
    24. 16. Parallelism in Scala and Akka
      1. Overview
      2. Scala
        1. Object creation
        2. Streams
          1. Memory on demand
          2. Design for reusing Streams memory
        3. Parallel collections
          1. Processing a parallel collection
          2. Benchmark framework
          3. Performance evaluation
      3. Scalability with Actors
        1. The Actor model
        2. Partitioning
        3. Beyond Actors – reactive programming
      4. Akka
        1. Master-workers
          1. Messages exchange
          2. Worker Actors
          3. The workflow controller
          4. The master Actor
          5. Master with routing
          6. Distributed discrete Fourier transform
          7. Limitations
        2. Futures
          1. Blocking on futures
          2. Future callbacks
          3. Putting it all together
      5. Summary
    25. 17. Apache Spark MLlib
      1. Overview
      2. Apache Spark core
        1. Why Spark?
        2. Design principles
          1. In-memory persistency
          2. Laziness
          3. Transforms and actions
          4. Shared variables
        3. Experimenting with Spark
          1. Deploying Spark
          2. Using Spark shell
      3. MLlib library
        1. Overview
        2. Creating RDDs
        3. K-means using MLlib
        4. Tests
      4. Reusable ML pipelines
        1. Reusable ML transforms
          1. Encoding features
          2. Training the model
          3. Predictive model
          4. Training summary statistics
          5. Validating the model
          6. Grid search
        2. Apache Spark and ScalaTest
      5. Extending Spark
        1. Kullback-Leibler divergence
        2. Implementation
        3. Kullback-Leibler evaluator
      6. Streaming engine
        1. Why streaming?
        2. Batch and real-time processing
        3. Architecture overview
        4. Discretized streams
        5. Use case – continuous parsing
        6. Checkpointing
      7. Performance evaluation
        1. Tuning parameters
        2. Performance considerations
      8. Pros and cons
      9. Summary
    26. A. Basic Concepts
      1. Scala programming
        1. List of libraries and tools
        2. Code snippets format
        3. Best practices
          1. Encapsulation
          2. Class constructor template
          3. Companion objects versus case classes
          4. Enumerations versus case classes
          5. Overloading
          6. Design template for immutable classifiers
        4. Utility classes
          1. Data extraction
          2. Financial data sources
          3. Documents extraction
          4. DMatrix class
          5. Counter
          6. Monitor
      2. Mathematics
        1. Linear algebra
          1. QR decomposition
          2. LU factorization
          3. LDL decomposition
          4. Cholesky factorization
          5. Singular Value Decomposition (SVD)
          6. Eigenvalue decomposition
          7. Algebraic and numerical libraries
        2. First order predicate logic
        3. Jacobian and Hessian matrices
        4. Summary of optimization techniques
          1. Gradient descent methods
            1. Steepest descent
            2. Conjugate gradient
            3. Stochastic gradient descent
          2. Quasi-Newton algorithms
            1. BFGS
            2. L-BFGS
          3. Nonlinear least squares minimization
            1. Gauss-Newton
            2. Levenberg-Marquardt
          4. Lagrange multipliers
        5. Overview dynamic programming
      3. Finances 101
        1. Fundamental analysis
        2. Technical analysis
          1. Terminology
          2. Trading data
          3. Trading signal and strategy
          4. Price patterns
        3. Options trading
        4. Financial data sources
      4. Suggested online courses
      5. References
    27. B. References
      1. Chapter 1
      2. Chapter 2
      3. Chapter 3
      4. Chapter 4
      5. Chapter 5
      6. Chapter 6
      7. Chapter 7
      8. Chapter 8
      9. Chapter 9
      10. Chapter 10
      11. Chapter 11
      12. Chapter 12
      13. Chapter 13
      14. Chapter 14
      15. Chapter 15
      16. Chapter 16
      17. Chapter 17
    28. Index