O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Spark for Data Science

Book Description

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products

About This Book

  • Develop and apply advanced analytical techniques with Spark
  • Learn how to tell a compelling story with data science using Spark's ecosystem
  • Explore data at scale and work with cutting edge data science methods

Who This Book Is For

This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.

What You Will Learn

  • Learn the design patterns that integrate Spark into industrialized data science pipelines
  • See how commercial data scientists design scalable code and reusable code for data science services
  • Explore cutting edge data science methods so that you can study trends and causality
  • Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs
  • Find out how Spark can be used as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Study advanced Spark concepts, solution design patterns, and integration architectures
  • Demonstrate powerful data science pipelines

In Detail

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance ?solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs.

This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.

You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Style and approach

This is an advanced guide for those with beginner-level familiarity with the Spark architecture and working with Data Science applications. Mastering Spark for Data Science is a practical tutorial that uses core Spark APIs and takes a deep dive into advanced libraries including: Spark SQL, visual streaming, and MLlib. This book expands on titles like: Machine Learning with Spark and Learning Spark. It is the next learning curve for those comfortable with Spark and looking to improve their skills.

Table of Contents

  1. Mastering Spark for Data Science
    1. Mastering Spark for Data Science
    2. Credits
    3. Foreword
    4. About the Authors
    5. About the Reviewer
    6. www.PacktPub.com
      1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. The Big Data Science Ecosystem
      1. Introducing the Big Data ecosystem
        1. Data management
        2. Data management responsibilities
        3. The right tool for the job
      2. Overall architecture
        1. Data Ingestion
        2. Data Lake
          1. Reliable storage
          2. Scalable data processing capability
        3. Data science platform
        4. Data Access
      3. Data technologies
        1. The role of Apache Spark
      4. Companion tools
        1. Apache HDFS
          1. Advantages
          2. Disadvantages
          3. Installation
        2. Amazon S3
          1. Advantages
          2. Disadvantages
          3. Installation
        3. Apache Kafka
          1. Advantages
          2. Disadvantages
          3. Installation
        4. Apache Parquet
          1. Advantages
          2. Disadvantages
          3. Installation
        5. Apache Avro
          1. Advantages
          2. Disadvantages
          3. Installation
        6. Apache NiFi
          1. Advantages
          2. Disadvantages
          3. Installation
        7. Apache YARN
          1. Advantages
          2. Disadvantages
          3. Installation
        8. Apache Lucene
          1. Advantages
          2. Disadvantages
          3. Installation
        9. Kibana
          1. Advantages
          2. Disadvantages
          3. Installation
        10. Elasticsearch
          1. Advantages
          2. Disadvantages
          3. Installation
        11. Accumulo
          1. Advantages
          2. Disadvantages
          3. Installation
      5. Summary
    10. 2. Data Acquisition
      1. Data pipelines
        1. Universal ingestion framework
        2. Introducing the GDELT news stream
          1. Discovering GDELT in real-time
          2. Our first GDELT feed
          3. Improving with publish and subscribe
      2. Content registry
        1. Choices and more choices
        2. Going with the flow
        3. Metadata model
        4. Kibana dashboard
      3. Quality assurance
        1. Example 1 - Basic quality checking, no contending users
        2. Example 2 - Advanced quality checking, no contending users
        3. Example 3 - Basic quality checking, 50% utility due to contending users
      4. Summary
    11. 3. Input Formats and Schema
      1. A structured life is a good life
      2. GDELT dimensional modeling
        1. GDELT model
          1. First look at the data
          2. Core global knowledge graph model
          3. Hidden complexity
          4. Denormalized models
          5. Challenges with flattened data
            1. Issue 1 - Loss of contextual information
            2. Issue 2: Re-establishing dimensions
            3. Issue 3: Including reference data
      3. Loading your data
        1. Schema agility
          1. Reality check
        2. GKG ELT
          1. Position matters
      4. Avro
        1. Spark-Avro method
        2. Pedagogical method
        3. When to perform Avro transformation
      5. Parquet
      6. Summary
    12. 4. Exploratory Data Analysis
      1. The problem, principles and planning
        1. Understanding the EDA problem
        2. Design principles
        3. General plan of exploration
      2. Preparation
        1. Introducing mask based data profiling
        2. Introducing character class masks
        3. Building a mask based profiler
          1. Setting up Apache Zeppelin
          2. Constructing a reusable notebook
      3. Exploring GDELT
        1. GDELT GKG datasets
          1. The files
          2. Special collections
          3. Reference data
        2. Exploring the GKG v2.1
          1. The Translingual files
          2. A configurable GCAM time series EDA
          3. Plot.ly charting on Apache Zeppelin
          4. Exploring translation sourced GCAM sentiment with plot.ly
          5. Concluding remarks
          6. A configurable GCAM Spatio-Temporal EDA
          7. Introducing GeoGCAM
          8. Does our spatial pivot work?
      4. Summary
    13. 5. Spark for Geographic Analysis
      1. GDELT and oil
        1. GDELT events
        2. GDELT GKG
      2. Formulating a plan of action
      3. GeoMesa
        1. Installing
        2. GDELT Ingest
        3. GeoMesa Ingest
          1. MapReduce to Spark
        4. Geohash
        5. GeoServer
          1. Map layers
          2. CQL
      4. Gauging oil prices
        1. Using the GeoMesa query API
        2. Data preparation
        3. Machine learning
        4. Naive Bayes
        5. Results
        6. Analysis
      5. Summary
    14. 6. Scraping Link-Based External Data
      1. Building a web scale news scanner
        1. Accessing the web content
          1. The Goose library
        2. Integration with Spark
          1. Scala compatibility
          2. Serialization issues
        3. Creating a scalable, production-ready library
          1. Build once, read many
          2. Exception handling
          3. Performance tuning
      2. Named entity recognition
        1. Scala libraries
        2. NLP walkthrough
          1. Extracting entities
          2. Abstracting methods
        3. Building a scalable code
          1. Build once, read many
          2. Scalability is also a state of mind
          3. Performance tuning
      3. GIS lookup
        1. GeoNames dataset
        2. Building an efficient join
          1. Offline strategy - Bloom filtering
          2. Online strategy - Hash partitioning
        3. Content deduplication
          1. Context learning
          2. Location scoring
      4. Names de-duplication
        1. Functional programming with Scalaz
          1. Our de-duplication strategy
          2. Using the mappend operator
        2. Simple clean
        3. DoubleMetaphone
      5. News index dashboard
      6. Summary
    15. 7. Building Communities
      1. Building a graph of persons
        1. Contact chaining
        2. Extracting data from Elasticsearch
      2. Using the Accumulo database
        1. Setup Accumulo
        2. Cell security
        3. Iterators
        4. Elasticsearch to Accumulo
          1. A graph data model in Accumulo
          2. Hadoop input and output formats
        5. Reading from Accumulo
        6. AccumuloGraphxInputFormat and EdgeWritable
        7. Building a graph
      3. Community detection algorithm
        1. Louvain algorithm
        2. Weighted Community Clustering (WCC)
          1. Description
          2. Preprocessing stage
          3. Initial communities
            1. Message passing
            2. Community back propagation
          4. WCC iteration
            1. Gathering community statistics
            2. WCC Computation
            3. WCC iteration
      4. GDELT dataset
        1. The Bowie effect
        2. Smaller communities
        3. Using Accumulo cell level security
      5. Summary
    16. 8. Building a Recommendation System
      1. Different approaches
        1. Collaborative filtering
        2. Content-based filtering
        3. Custom approach
      2. Uninformed data
        1. Processing bytes
        2. Creating a scalable code
        3. From time to frequency domain
          1. Fast Fourier transform
          2. Sampling by time window
          3. Extracting audio signatures
      3. Building a song analyzer
        1. Selling data science is all about selling cupcakes
          1. Using Cassandra
          2. Using the Play framework
      4. Building a recommender
        1. The PageRank algorithm
          1. Building a Graph of Frequency Co-occurrence
          2. Running PageRank
        2. Building personalized playlists
        3. Expanding our cupcake factory
          1. Building a playlist service
          2. Leveraging the Spark job server
          3. User interface
      5. Summary
    17. 9. News Dictionary and Real-Time Tagging System
      1. The mechanical Turk
        1. Human intelligence tasks
        2. Bootstrapping a classification model
          1. Learning from Stack Exchange
          2. Building text features
          3. Training a Naive Bayes model
        3. Laziness, impatience, and hubris
      2. Designing a Spark Streaming application
        1. A tale of two architectures
          1. The CAP theorem
          2. The Greeks are here to help
        2. Importance of the Lambda architecture
        3. Importance of the Kappa architecture
      3. Consuming data streams
        1. Creating a GDELT data stream
          1. Creating a Kafka topic
          2. Publishing content to a Kafka topic
          3. Consuming Kafka from Spark Streaming
        2. Creating a Twitter data stream
      4. Processing Twitter data
        1. Extracting URLs and hashtags
        2. Keeping popular hashtags
        3. Expanding shortened URLs
      5. Fetching HTML content
      6. Using Elasticsearch as a caching layer
      7. Classifying data
        1. Training a Naive Bayes model
        2. Thread safety
        3. Predict the GDELT data
      8. Our Twitter mechanical Turk
      9. Summary
    18. 10. Story De-duplication and Mutation
      1. Detecting near duplicates
        1. First steps with hashing
        2. Standing on the shoulders of the Internet giants
          1. Simhashing
          2. The hamming weight
        3. Detecting near duplicates in GDELT
        4. Indexing the GDELT database
          1. Persisting our RDDs
          2. Building a REST API
          3. Area of improvement
      2. Building stories
        1. Building term frequency vectors
        2. The curse of dimensionality, the data science plague
        3. Optimizing KMeans
      3. Story mutation
        1. The Equilibrium state
        2. Tracking stories over time
          1. Building a streaming application
          2. Streaming KMeans
          3. Visualization
        3. Building story connections
      4. Summary
    19. 11. Anomaly Detection on Sentiment Analysis
      1. Following the US elections on Twitter
        1. Acquiring data in stream
        2. Acquiring data in batch
          1. The search API
          2. Rate limit
      2. Analysing sentiment
        1. Massaging Twitter data
        2. Using the Stanford NLP
        3. Building the Pipeline
      3. Using Timely as a time series database
        1. Storing data
        2. Using Grafana to visualize sentiment
          1. Number of processed tweets
          2. Give me my Twitter account back
          3. Identifying the swing states
      4. Twitter and the Godwin point
        1. Learning context
        2. Visualizing our model
        3. Word2Graph and Godwin point
          1. Building a Word2Graph
          2. Random walks
      5. A Small Step into sarcasm detection
        1. Building features
          1. #LoveTrumpsHates
          2. Scoring Emojis
          3. Training a KMeans model
        2. Detecting anomalies
      6. Summary
    20. 12. TrendCalculus
      1. Studying trends
      2. The TrendCalculus algorithm
        1. Trend windows
        2. Simple trend
        3. User Defined Aggregate Functions
        4. Simple trend calculation
        5. Reversal rule
        6. Introducing the FHLS bar structure
        7. Visualize the data
          1. FHLS with reversals
          2. Edge cases
            1. Zero values
            2. Completing the gaps
          3. Stackable processing
      3. Practical applications
        1. Algorithm characteristics
          1. Advantages
          2. Disadvantages
        2. Possible use cases
          1. Chart annotation
          2. Co-trending
          3. Data reduction
          4. Indexing
          5. Fractal dimension
          6. Streaming proxy for piecewise linear regression
      4. Summary
    21. 13. Secure Data
      1. Data security
        1. The problem
        2. The basics
      2. Authentication and authorization
        1. Access control lists (ACL)
        2. Role-based access control (RBAC)
      3. Access
      4. Encryption
        1. Data at rest
          1. Java KeyStore
          2. S3 encryption
        2. Data in transit
        3. Obfuscation/Anonymizing
        4. Masking
        5. Tokenization
          1. Using a Hybrid approach
      5. Data disposal
      6. Kerberos authentication
        1. Use case 1: Apache Spark accessing data in secure HDFS
        2. Use case 2: extending to automated authentication
        3. Use case 3: connecting to secure databases from Spark
      7. Security ecosystem
        1. Apache sentry
        2. RecordService
        3. Apache ranger
        4. Apache Knox
      8. Your Secure Responsibility
      9. Summary
    22. 14. Scalable Algorithms
      1. General principles
      2. Spark architecture
        1. History of Spark
        2. Moving parts
          1. Driver
          2. SparkSession
          3. Resilient distributed datasets (RDDs)
          4. Executor
          5. Shuffle operation
          6. Cluster Manager
          7. Task
          8. DAG
          9. DAG scheduler
          10. Transformations
          11. Stages
          12. Actions
          13. Task scheduler
      3. Challenges
        1. Algorithmic complexity
        2. Numerical anomalies
        3. Shuffle
        4. Data schemes
      4. Plotting your course
        1. Be iterative
          1. Data preparation
          2. Scale up slowly
          3. Estimate performance
          4. Step through carefully
          5. Tune your analytic
      5. Design patterns and techniques
        1. Spark APIs
          1. Problem
          2. Solution
            1. Example
        2. Summary pattern
          1. Problem
          2. Solution
            1. Example
        3. Expand and Conquer Pattern
          1. Problem
          2. Solution
        4. Lightweight Shuffle
          1. Problem
          2. Solution
        5. Wide Table pattern
          1. Problem
          2. Solution
            1. Example
        6. Broadcast variables pattern
          1. Problem
          2. Solution
            1. Creating a broadcast variable
            2. Accessing a broadcast variable
            3. Removing a broadcast variable
            4. Example
        7. Combiner pattern
          1. Problem
          2. Solution
            1. Example
        8. Optimized cluster
          1. Problem
          2. Solution
        9. Redistribution pattern
          1. Problem
          2. Solution
            1. Example
        10. Salting key pattern
          1. Problem
          2. Solution
        11. Secondary sort pattern
          1. Problem
          2. Solution
            1. Example
        12. Filter overkill pattern
          1. Problem
          2. Solution
        13. Probabilistic algorithms
          1. Problem
          2. Solution
            1. Example
        14. Selective caching
          1. Problem
          2. Solution
        15. Garbage collection
          1. Problem
          2. Solution
        16. Graph traversal
          1. Problem
          2. Solution
          3. Example
      6. Summary