O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Scala:Applied Machine Learning

Book Description

Leverage the power of Scala and master the art of building, improving, and validating scalable machine learning and AI applications using Scala's most advanced and finest features

About This Book

  • Build functional, type-safe routines to interact with relational and NoSQL databases with the help of the tutorials and examples provided
  • Leverage your expertise in Scala programming to create and customize your own scalable machine learning algorithms
  • Experiment with different techniques; evaluate their benefits and limitations using real-world financial applications
  • Get to know the best practices to incorporate new Big Data machine learning in your data-driven enterprise and gain future scalability and maintainability

Who This Book Is For

This Learning Path is for engineers and scientists who are familiar with Scala and want to learn how to create, validate, and apply machine learning algorithms. It will also benefit software developers with a background in Scala programming who want to apply machine learning.

What You Will Learn

  • Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
  • Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive
  • Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters
  • Apply key learning strategies to perform technical analysis of financial markets
  • Understand the principles of supervised and unsupervised learning in machine learning
  • Work with unstructured data and serialize it using Kryo, Protobuf, Avro, and AvroParquet
  • Construct reliable and robust data pipelines and manage data in a data-driven enterprise
  • Implement scalable model monitoring and alerts with Scala

In Detail

This Learning Path aims to put the entire world of machine learning with Scala in front of you.

Scala for Data Science, the first module in this course, is a tutorial guide that provides tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed building data science and data engineering solutions.

The second course, Scala for Machine Learning guides you through the process of building AI applications with diagrams, formal mathematical notation, source code snippets, and useful tips. A review of the Akka framework and Apache Spark clusters concludes the tutorial.

The next module, Mastering Scala Machine Learning, is the final step in this course. It will take your knowledge to next level and help you use the knowledge to build advanced applications such as social media mining, intelligent news portals, and more. After a quick refresher on functional programming concepts using REPL, you will see some practical examples of setting up the development environment and tinkering with data. We will then explore working with Spark and MLlib using k-means and decision trees.

By the end of this course, you will be a master at Scala machine learning and have enough expertise to be able to build complex machine learning projects using Scala.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

  • Scala for Data Science, Pascal Bugnion
  • Scala for Machine Learning, Patrick Nicolas
  • Mastering Scala Machine Learning, Alex Kozlov

Style and approach

A tutorial with complete examples, this course will give you the tools to start building useful data engineering and data science solutions straightaway. This course provides practical examples from the field on how to correctly tackle data analysis problems, particularly for modern Big Data datasets.

Table of Contents

  1. Scala:Applied Machine Learning
    1. Table of Contents
    2. Scala:Applied Machine Learning
    3. Scala:Applied Machine Learning
    4. Credits
    5. Preface
      1. What this learning path covers
      2. What you need for this learning path
        1. Module 1
          1. Installing the JDK
          2. Installing and using SBT
        2. Module 2
        3. Module 3
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    6. I. Module 1
      1. 1. Scala and Data Science
        1. Data science
        2. Programming in data science
        3. Why Scala?
          1. Static typing and type inference
          2. Scala encourages immutability
          3. Scala and functional programs
          4. Null pointer uncertainty
          5. Easier parallelism
          6. Interoperability with Java
        4. When not to use Scala
        5. Summary
        6. References
      2. 2. Manipulating Data with Breeze
        1. Code examples
        2. Installing Breeze
        3. Getting help on Breeze
        4. Basic Breeze data types
          1. Vectors
          2. Dense and sparse vectors and the vector trait
          3. Matrices
          4. Building vectors and matrices
          5. Advanced indexing and slicing
          6. Mutating vectors and matrices
          7. Matrix multiplication, transposition, and the orientation of vectors
          8. Data preprocessing and feature engineering
          9. Breeze – function optimization
          10. Numerical derivatives
          11. Regularization
        5. An example – logistic regression
        6. Towards re-usable code
        7. Alternatives to Breeze
        8. Summary
        9. References
      3. 3. Plotting with breeze-viz
        1. Diving into Breeze
        2. Customizing plots
        3. Customizing the line type
        4. More advanced scatter plots
        5. Multi-plot example – scatterplot matrix plots
        6. Managing without documentation
        7. Breeze-viz reference
        8. Data visualization beyond breeze-viz
        9. Summary
      4. 4. Parallel Collections and Futures
        1. Parallel collections
          1. Limitations of parallel collections
          2. Error handling
          3. Setting the parallelism level
          4. An example – cross-validation with parallel collections
        2. Futures
          1. Future composition – using a future's result
          2. Blocking until completion
          3. Controlling parallel execution with execution contexts
          4. Futures example – stock price fetcher
        3. Summary
        4. References
      5. 5. Scala and SQL through JDBC
        1. Interacting with JDBC
        2. First steps with JDBC
          1. Connecting to a database server
          2. Creating tables
          3. Inserting data
          4. Reading data
        3. JDBC summary
        4. Functional wrappers for JDBC
        5. Safer JDBC connections with the loan pattern
        6. Enriching JDBC statements with the "pimp my library" pattern
        7. Wrapping result sets in a stream
        8. Looser coupling with type classes
          1. Type classes
          2. Coding against type classes
          3. When to use type classes
          4. Benefits of type classes
        9. Creating a data access layer
        10. Summary
        11. References
      6. 6. Slick – A Functional Interface for SQL
        1. FEC data
          1. Importing Slick
          2. Defining the schema
          3. Connecting to the database
          4. Creating tables
          5. Inserting data
          6. Querying data
        2. Invokers
        3. Operations on columns
        4. Aggregations with "Group by"
        5. Accessing database metadata
        6. Slick versus JDBC
        7. Summary
        8. References
      7. 7. Web APIs
        1. A whirlwind tour of JSON
        2. Querying web APIs
        3. JSON in Scala – an exercise in pattern matching
          1. JSON4S types
          2. Extracting fields using XPath
        4. Extraction using case classes
        5. Concurrency and exception handling with futures
        6. Authentication – adding HTTP headers
          1. HTTP – a whirlwind overview
          2. Adding headers to HTTP requests in Scala
        7. Summary
        8. References
      8. 8. Scala and MongoDB
        1. MongoDB
        2. Connecting to MongoDB with Casbah
          1. Connecting with authentication
        3. Inserting documents
        4. Extracting objects from the database
        5. Complex queries
        6. Casbah query DSL
        7. Custom type serialization
        8. Beyond Casbah
        9. Summary
        10. References
      9. 9. Concurrency with Akka
        1. GitHub follower graph
        2. Actors as people
        3. Hello world with Akka
        4. Case classes as messages
        5. Actor construction
        6. Anatomy of an actor
        7. Follower network crawler
        8. Fetcher actors
        9. Routing
        10. Message passing between actors
        11. Queue control and the pull pattern
        12. Accessing the sender of a message
        13. Stateful actors
        14. Follower network crawler
        15. Fault tolerance
        16. Custom supervisor strategies
        17. Life-cycle hooks
        18. What we have not talked about
        19. Summary
        20. References
      10. 10. Distributed Batch Processing with Spark
        1. Installing Spark
        2. Acquiring the example data
        3. Resilient distributed datasets
          1. RDDs are immutable
          2. RDDs are lazy
          3. RDDs know their lineage
          4. RDDs are resilient
          5. RDDs are distributed
          6. Transformations and actions on RDDs
          7. Persisting RDDs
          8. Key-value RDDs
          9. Double RDDs
        4. Building and running standalone programs
          1. Running Spark applications locally
          2. Reducing logging output and Spark configuration
          3. Running Spark applications on EC2
        5. Spam filtering
        6. Lifting the hood
        7. Data shuffling and partitions
        8. Summary
        9. Reference
      11. 11. Spark SQL and DataFrames
        1. DataFrames – a whirlwind introduction
        2. Aggregation operations
        3. Joining DataFrames together
        4. Custom functions on DataFrames
        5. DataFrame immutability and persistence
        6. SQL statements on DataFrames
        7. Complex data types – arrays, maps, and structs
          1. Structs
          2. Arrays
          3. Maps
        8. Interacting with data sources
          1. JSON files
          2. Parquet files
        9. Standalone programs
        10. Summary
        11. References
      12. 12. Distributed Machine Learning with MLlib
        1. Introducing MLlib – Spam classification
        2. Pipeline components
          1. Transformers
          2. Estimators
        3. Evaluation
        4. Regularization in logistic regression
        5. Cross-validation and model selection
        6. Beyond logistic regression
        7. Summary
        8. References
      13. 13. Web APIs with Play
        1. Client-server applications
        2. Introduction to web frameworks
        3. Model-View-Controller architecture
        4. Single page applications
        5. Building an application
        6. The Play framework
        7. Dynamic routing
        8. Actions
          1. Composing the response
          2. Understanding and parsing the request
        9. Interacting with JSON
        10. Querying external APIs and consuming JSON
          1. Calling external web services
          2. Parsing JSON
          3. Asynchronous actions
        11. Creating APIs with Play: a summary
        12. Rest APIs: best practice
        13. Summary
        14. References
      14. 14. Visualization with D3 and the Play Framework
        1. GitHub user data
        2. Do I need a backend?
        3. JavaScript dependencies through web-jars
        4. Towards a web application: HTML templates
        5. Modular JavaScript through RequireJS
        6. Bootstrapping the applications
        7. Client-side program architecture
          1. Designing the model
          2. The event bus
          3. AJAX calls through JQuery
          4. Response views
        8. Drawing plots with NVD3
        9. Summary
        10. References
      15. A. Pattern Matching and Extractors
        1. Pattern matching in for comprehensions
        2. Pattern matching internals
        3. Extracting sequences
        4. Summary
        5. Reference
    7. II. Module 2
      1. 1. Getting Started
        1. Mathematical notation for the curious
        2. Why machine learning?
          1. Classification
          2. Prediction
          3. Optimization
          4. Regression
        3. Why Scala?
          1. Abstraction
            1. Higher-kind projection
            2. Covariant functors for vectors
            3. Contravariant functors for co-vectors
            4. Monads
          2. Scalability
          3. Configurability
          4. Maintainability
          5. Computation on demand
        4. Model categorization
        5. Taxonomy of machine learning algorithms
          1. Unsupervised learning
            1. Clustering
            2. Dimension reduction
          2. Supervised learning
            1. Generative models
            2. Discriminative models
          3. Semi-supervised learning
          4. Reinforcement learning
        6. Don't reinvent the wheel!
        7. Tools and frameworks
          1. Java
          2. Scala
          3. Apache Commons Math
            1. Description
            2. Licensing
            3. Installation
          4. JFreeChart
            1. Description
            2. Licensing
            3. Installation
          5. Other libraries and frameworks
        8. Source code
          1. Context versus view bounds
          2. Presentation
          3. Primitives and implicits
            1. Primitive types
            2. Type conversions
          4. Immutability
          5. Performance of Scala iterators
        9. Let's kick the tires
          1. An overview of computational workflows
          2. Writing a simple workflow
            1. Step 1 – scoping the problem
            2. Step 2 – loading data
            3. Step 3 – preprocessing the data
              1. Immutable normalization
            4. Step 4 – discovering patterns
              1. Analyzing data
              2. Plotting data
            5. Step 5 – implementing the classifier
              1. Selecting an optimizer
              2. Training the model
              3. Classifying observations
            6. Step 6 – evaluating the model
        10. Summary
      2. 2. Hello World!
        1. Modeling
          1. A model by any other name
          2. Model versus design
          3. Selecting features
          4. Extracting features
        2. Defining a methodology
        3. Monadic data transformation
          1. Error handling
          2. Explicit models
          3. Implicit models
        4. A workflow computational model
          1. Supporting mathematical abstractions
            1. Step 1 – variable declaration
            2. Step 2 – model definition
            3. Step 3 – instantiation
          2. Composing mixins to build a workflow
            1. Understanding the problem
            2. Defining modules
            3. Instantiating the workflow
          3. Modularization
        5. Profiling data
          1. Immutable statistics
          2. Z-Score and Gauss
        6. Assessing a model
          1. Validation
            1. Key quality metrics
            2. F-score for binomial classification
            3. F-score for multinomial classification
          2. Cross-validation
            1. One-fold cross validation
            2. K-fold cross validation
          3. Bias-variance decomposition
          4. Overfitting
        7. Summary
      3. 3. Data Preprocessing
        1. Time series in Scala
          1. Types and operations
          2. The magnet pattern
            1. The transpose operator
            2. The differential operator
          3. Lazy views
        2. Moving averages
          1. The simple moving average
          2. The weighted moving average
          3. The exponential moving average
        3. Fourier analysis
          1. Discrete Fourier transform
          2. DFT-based filtering
          3. Detection of market cycles
        4. The discrete Kalman filter
          1. The state space estimation
            1. The transition equation
            2. The measurement equation
          2. The recursive algorithm
            1. Prediction
            2. Correction
            3. Kalman smoothing
            4. Fixed lag smoothing
            5. Experimentation
            6. Benefits and drawbacks
        5. Alternative preprocessing techniques
        6. Summary
      4. 4. Unsupervised Learning
        1. Clustering
          1. K-means clustering
            1. Measuring similarity
            2. Defining the algorithm
            3. Step 1 – cluster configuration
              1. Defining clusters
              2. Initializing clusters
            4. Step 2 – cluster assignment
            5. Step 3 – reconstruction/error minimization
              1. Creating K-means components
              2. Tail recursive implementation
              3. Iterative implementation
            6. Step 4 – classification
            7. The curse of dimensionality
            8. Setting up the evaluation
            9. Evaluating the results
            10. Tuning the number of clusters
            11. Validation
          2. The expectation-maximization algorithm
            1. Gaussian mixture models
            2. Overview of EM
            3. Implementation
            4. Classification
            5. Testing
            6. The online EM algorithm
        2. Dimension reduction
          1. Principal components analysis
            1. Algorithm
            2. Implementation
            3. Test case
            4. Evaluation
          2. Non-linear models
            1. Kernel PCA
            2. Manifolds
        3. Performance considerations
          1. K-means
          2. EM
          3. PCA
        4. Summary
      5. 5. Naïve Bayes Classifiers
        1. Probabilistic graphical models
        2. Naïve Bayes classifiers
          1. Introducing the multinomial Naïve Bayes
            1. Formalism
            2. The frequentist perspective
            3. The predictive model
            4. The zero-frequency problem
          2. Implementation
            1. Design
            2. Training
              1. Class likelihood
              2. Binomial model
              3. The multinomial model
              4. Classifier components
            3. Classification
            4. F1 validation
            5. Feature extraction
            6. Testing
        3. The Multivariate Bernoulli classification
          1. Model
          2. Implementation
        4. Naïve Bayes and text mining
          1. Basics of information retrieval
          2. Implementation
            1. Analyzing documents
            2. Extracting the frequency of relative terms
            3. Generating the features
          3. Testing
            1. Retrieving the textual information
            2. Evaluating the text mining classifier
        5. Pros and cons
        6. Summary
      6. 6. Regression and Regularization
        1. Linear regression
          1. One-variate linear regression
            1. Implementation
            2. Test case
          2. Ordinary least squares regression
            1. Design
            2. Implementation
            3. Test case 1 – trending
            4. Test case 2 – feature selection
        2. Regularization
          1. Ln roughness penalty
          2. Ridge regression
            1. Design
            2. Implementation
            3. Test case
        3. Numerical optimization
        4. Logistic regression
          1. Logistic function
          2. Binomial classification
          3. Design
          4. The training workflow
            1. Step 1 – configuring the optimizer
            2. Step 2 – computing the Jacobian matrix
            3. Step 3 – managing the convergence of the optimizer
            4. Step 4 – defining the least squares problem
            5. Step 5 – minimizing the sum of square errors
            6. Test
          5. Classification
        5. Summary
      7. 7. Sequential Data Models
        1. Markov decision processes
          1. The Markov property
          2. The first order discrete Markov chain
        2. The hidden Markov model
          1. Notations
          2. The lambda model
          3. Design
          4. Evaluation – CF-1
            1. Alpha – the forward pass
            2. Beta – the backward pass
          5. Training – CF-2
            1. The Baum-Welch estimator (EM)
          6. Decoding – CF-3
            1. The Viterbi algorithm
          7. Putting it all together
          8. Test case 1 – training
          9. Test case 2 – evaluation
          10. HMM as a filtering technique
        3. Conditional random fields
          1. Introduction to CRF
          2. Linear chain CRF
        4. Regularized CRFs and text analytics
          1. The feature functions model
          2. Design
          3. Implementation
            1. Configuring the CRF classifier
            2. Training the CRF model
            3. Applying the CRF model
          4. Tests
            1. The training convergence profile
            2. Impact of the size of the training set
            3. Impact of the L2 regularization factor
        5. Comparing CRF and HMM
        6. Performance consideration
        7. Summary
      8. 8. Kernel Models and Support Vector Machines
        1. Kernel functions
          1. An overview
          2. Common discriminative kernels
          3. Kernel monadic composition
        2. Support vector machines
          1. The linear SVM
            1. The separable case – the hard margin
            2. The nonseparable case – the soft margin
          2. The nonlinear SVM
            1. Max-margin classification
            2. The kernel trick
        3. Support vector classifiers – SVC
          1. The binary SVC
            1. LIBSVM
            2. Design
            3. Configuration parameters
              1. The SVM formulation
              2. The SVM kernel function
              3. The SVM execution
            4. Interface to LIBSVM
            5. Training
            6. Classification
            7. C-penalty and margin
            8. Kernel evaluation
            9. Applications in risk analysis
        4. Anomaly detection with one-class SVC
        5. Support vector regression
          1. An overview
          2. SVR versus linear regression
        6. Performance considerations
        7. Summary
      9. 9. Artificial Neural Networks
        1. Feed-forward neural networks
          1. The biological background
          2. Mathematical background
        2. The multilayer perceptron
          1. The activation function
          2. The network topology
          3. Design
          4. Configuration
          5. Network components
            1. The network topology
            2. Input and hidden layers
            3. The output layer
            4. Synapses
            5. Connections
            6. The initialization weights
          6. The model
          7. Problem types (modes)
          8. Online training versus batch training
          9. The training epoch
            1. Step 1 – input forward propagation
              1. The computational flow
              2. Error functions
              3. Operating modes
              4. Softmax
            2. Step 2 – error backpropagation
              1. Weights' adjustment
              2. The error propagation
              3. The computational model
            3. Step 3 – exit condition
            4. Putting it all together
          10. Training and classification
            1. Regularization
            2. The model generation
            3. The Fast Fisher-Yates shuffle
            4. Prediction
            5. Model fitness
        3. Evaluation
          1. The execution profile
          2. Impact of the learning rate
          3. The impact of the momentum factor
          4. The impact of the number of hidden layers
          5. Test case
            1. Implementation
            2. Evaluation of models
            3. Impact of the hidden layers' architecture
        4. Convolution neural networks
          1. Local receptive fields
          2. Sharing of weights
          3. Convolution layers
          4. Subsampling layers
          5. Putting it all together
        5. Benefits and limitations
        6. Summary
      10. 10. Genetic Algorithms
        1. Evolution
          1. The origin
          2. NP problems
          3. Evolutionary computing
        2. Genetic algorithms and machine learning
        3. Genetic algorithm components
          1. Encoding
            1. Value encoding
            2. Predicate encoding
            3. Solution encoding
            4. The encoding scheme
              1. Flat encoding
              2. Hierarchical encoding
          2. Genetic operators
            1. Selection
            2. Crossover
            3. Mutation
          3. The fitness score
        4. Implementation
          1. Software design
          2. Key components
            1. Population
            2. Chromosomes
            3. Genes
          3. Selection
          4. Controlling the population growth
          5. The GA configuration
          6. Crossover
            1. Population
            2. Chromosomes
            3. Genes
          7. Mutation
            1. Population
            2. Chromosomes
            3. Genes
          8. Reproduction
          9. Solver
        5. GA for trading strategies
          1. Definition of trading strategies
            1. Trading operators
            2. The cost function
            3. Trading signals
            4. Trading strategies
            5. Trading signal encoding
          2. A test case
            1. Creating trading strategies
            2. Configuring the optimizer
            3. Finding the best trading strategy
            4. Tests
              1. The weighted score
              2. The unweighted score
        6. Advantages and risks of genetic algorithms
        7. Summary
      11. 11. Reinforcement Learning
        1. Reinforcement learning
          1. The problem
          2. A solution – Q-learning
            1. Terminology
            2. Concepts
            3. Value of a policy
            4. The Bellman optimality equations
            5. Temporal difference for model-free learning
            6. Action-value iterative update
          3. Implementation
            1. Software design
            2. The states and actions
            3. The search space
            4. The policy and action-value
            5. The Q-learning components
            6. The Q-learning training
            7. Tail recursion to the rescue
            8. The validation
            9. The prediction
          4. Option trading using Q-learning
            1. The OptionProperty class
            2. The OptionModel class
            3. Quantization
          5. Putting it all together
          6. Evaluation
          7. Pros and cons of reinforcement learning
        2. Learning classifier systems
          1. Introduction to LCS
          2. Why LCS?
          3. Terminology
          4. Extended learning classifier systems
          5. XCS components
            1. Application to portfolio management
            2. The XCS core data
            3. XCS rules
            4. Covering
            5. An implementation example
          6. Benefits and limitations of learning classifier systems
        3. Summary
      12. 12. Scalable Frameworks
        1. An overview
        2. Scala
          1. Object creation
          2. Streams
          3. Parallel collections
            1. Processing a parallel collection
            2. The benchmark framework
            3. Performance evaluation
        3. Scalability with Actors
          1. The Actor model
          2. Partitioning
          3. Beyond actors – reactive programming
        4. Akka
          1. Master-workers
            1. Exchange of messages
            2. Worker actors
            3. The workflow controller
            4. The master actor
            5. Master with routing
            6. Distributed discrete Fourier transform
            7. Limitations
          2. Futures
            1. The Actor life cycle
            2. Blocking on futures
            3. Handling future callbacks
            4. Putting it all together
        5. Apache Spark
          1. Why Spark?
          2. Design principles
            1. In-memory persistency
            2. Laziness
            3. Transforms and actions
            4. Shared variables
          3. Experimenting with Spark
            1. Deploying Spark
            2. Using Spark shell
            3. MLlib
            4. RDD generation
            5. K-means using Spark
          4. Performance evaluation
            1. Tuning parameters
            2. Tests
            3. Performance considerations
          5. Pros and cons
          6. 0xdata Sparkling Water
        6. Summary
      13. A. Basic Concepts
        1. Scala programming
          1. List of libraries and tools
          2. Code snippets format
          3. Best practices
            1. Encapsulation
            2. Class constructor template
            3. Companion objects versus case classes
            4. Enumerations versus case classes
            5. Overloading
            6. Design template for immutable classifiers
          4. Utility classes
            1. Data extraction
            2. Data sources
            3. Extraction of documents
            4. DMatrix class
            5. Counter
            6. Monitor
        2. Mathematics
          1. Linear algebra
            1. QR decomposition
            2. LU factorization
            3. LDL decomposition
            4. Cholesky factorization
            5. Singular Value Decomposition
            6. Eigenvalue decomposition
            7. Algebraic and numerical libraries
          2. First order predicate logic
          3. Jacobian and Hessian matrices
          4. Summary of optimization techniques
            1. Gradient descent methods
              1. Steepest descent
              2. Conjugate gradient
              3. Stochastic gradient descent
            2. Quasi-Newton algorithms
              1. BFGS
              2. L-BFGS
            3. Nonlinear least squares minimization
              1. Gauss-Newton
              2. Levenberg-Marquardt
            4. Lagrange multipliers
          5. Overview of dynamic programming
        3. Finances 101
          1. Fundamental analysis
          2. Technical analysis
            1. Terminology
            2. Trading data
            3. Trading signals and strategy
            4. Price patterns
          3. Options trading
          4. Financial data sources
        4. Suggested online courses
        5. References
    8. III. Module 3
      1. 1. Exploratory Data Analysis
        1. Getting started with Scala
        2. Distinct values of a categorical field
        3. Summarization of a numeric field
          1. Grepping across multiple fields
        4. Basic, stratified, and consistent sampling
        5. Working with Scala and Spark Notebooks
        6. Basic correlations
        7. Summary
      2. 2. Data Pipelines and Modeling
        1. Influence diagrams
        2. Sequential trials and dealing with risk
        3. Exploration and exploitation
        4. Unknown unknowns
        5. Basic components of a data-driven system
          1. Data ingest
          2. Data transformation layer
          3. Data analytics and machine learning
          4. UI component
          5. Actions engine
          6. Correlation engine
          7. Monitoring
        6. Optimization and interactivity
          1. Feedback loops
        7. Summary
      3. 3. Working with Spark and MLlib
        1. Setting up Spark
        2. Understanding Spark architecture
          1. Task scheduling
          2. Spark components
          3. MQTT, ZeroMQ, Flume, and Kafka
          4. HDFS, Cassandra, S3, and Tachyon
          5. Mesos, YARN, and Standalone
        3. Applications
          1. Word count
          2. Streaming word count
          3. Spark SQL and DataFrame
        4. ML libraries
          1. SparkR
          2. Graph algorithms – GraphX and GraphFrames
        5. Spark performance tuning
        6. Running Hadoop HDFS
        7. Summary
      4. 4. Supervised and Unsupervised Learning
        1. Records and supervised learning
          1. Iris dataset
          2. Labeled point
          3. SVMWithSGD
          4. Logistic regression
          5. Decision tree
          6. Bagging and boosting – ensemble learning methods
        2. Unsupervised learning
        3. Problem dimensionality
        4. Summary
      5. 5. Regression and Classification
        1. What regression stands for?
        2. Continuous space and metrics
        3. Linear regression
        4. Logistic regression
        5. Regularization
        6. Multivariate regression
        7. Heteroscedasticity
        8. Regression trees
        9. Classification metrics
        10. Multiclass problems
        11. Perceptron
        12. Generalization error and overfitting
        13. Summary
      6. 6. Working with Unstructured Data
        1. Nested data
        2. Other serialization formats
        3. Hive and Impala
        4. Sessionization
        5. Working with traits
        6. Working with pattern matching
        7. Other uses of unstructured data
        8. Probabilistic structures
        9. Projections
        10. Summary
      7. 7. Working with Graph Algorithms
        1. A quick introduction to graphs
        2. SBT
        3. Graph for Scala
          1. Adding nodes and edges
          2. Graph constraints
          3. JSON
        4. GraphX
          1. Who is getting e-mails?
          2. Connected components
          3. Triangle counting
          4. Strongly connected components
          5. PageRank
          6. SVD++
        5. Summary
      8. 8. Integrating Scala with R and Python
        1. Integrating with R
          1. Setting up R and SparkR
            1. Linux
            2. Mac OS
            3. Windows
            4. Running SparkR via scripts
            5. Running Spark via R's command line
          2. DataFrames
          3. Linear models
          4. Generalized linear model
          5. Reading JSON files in SparkR
          6. Writing Parquet files in SparkR
          7. Invoking Scala from R
            1. Using Rserve
        2. Integrating with Python
          1. Setting up Python
          2. PySpark
          3. Calling Python from Java/Scala
            1. Using sys.process._
            2. Spark pipe
            3. Jython and JSR 223
        3. Summary
      9. 9. NLP in Scala
        1. Text analysis pipeline
          1. Simple text analysis
        2. MLlib algorithms in Spark
          1. TF-IDF
          2. LDA
        3. Segmentation, annotation, and chunking
        4. POS tagging
        5. Using word2vec to find word relationships
          1. A Porter Stemmer implementation of the code
        6. Summary
      10. 10. Advanced Model Monitoring
        1. System monitoring
        2. Process monitoring
        3. Model monitoring
          1. Performance over time
          2. Criteria for model retiring
          3. A/B testing
        4. Summary
    9. A. Bibliography
    10. Index