book

Scala for Data Science

Name: Scala for Data Science
Author: Pascal Bugnion
ISBN: 9781785281372

by Pascal Bugnion

January 2016

Intermediate to advanced

416 pages

8h 54m

English

Packt Publishing

Read now

Unlock full access

Scala for Data Science
Table of Contents
Scala for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book
Installing the JDKInstalling and using SBT
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyeBooks, discount offers, and moreQuestions
1. Scala and Data Science
Data science
Programming in data science
Why Scala?
Static typing and type inferenceScala encourages immutabilityScala and functional programsNull pointer uncertaintyEasier parallelismInteroperability with Java
When not to use Scala
Summary
References
2. Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
VectorsDense and sparse vectors and the vector traitMatricesBuilding vectors and matricesAdvanced indexing and slicingMutating vectors and matricesMatrix multiplication, transposition, and the orientation of vectorsData preprocessing and feature engineeringBreeze – function optimizationNumerical derivativesRegularization
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
3. Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
4. Parallel Collections and Futures
Parallel collectionsLimitations of parallel collectionsError handlingSetting the parallelism levelAn example – cross-validation with parallel collections
Futures
Future composition – using a future's resultBlocking until completionControlling parallel execution with execution contextsFutures example – stock price fetcher
Summary
References
5. Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
Connecting to a database serverCreating tablesInserting dataReading data
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Type classesCoding against type classesWhen to use type classesBenefits of type classes
Creating a data access layer
Summary
References
6. Slick – A Functional Interface for SQL
FEC dataImporting SlickDefining the schemaConnecting to the databaseCreating tablesInserting dataQuerying data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References
7. Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
JSON4S typesExtracting fields using XPath
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
HTTP – a whirlwind overviewAdding headers to HTTP requests in Scala
Summary
References
8. Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Connecting with authentication
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
9. Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
10. Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
RDDs are immutableRDDs are lazyRDDs know their lineageRDDs are resilientRDDs are distributedTransformations and actions on RDDsPersisting RDDsKey-value RDDsDouble RDDs
Building and running standalone programs
Running Spark applications locallyReducing logging output and Spark configurationRunning Spark applications on EC2
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
11. Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
StructsArraysMaps
Interacting with data sources
JSON filesParquet files
Standalone programs
Summary
References
12. Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
TransformersEstimators
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
13. Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Composing the responseUnderstanding and parsing the request
Interacting with JSON
Querying external APIs and consuming JSON
Calling external web servicesParsing JSONAsynchronous actions
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
14. Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Designing the modelThe event busAJAX calls through JQueryResponse views
Drawing plots with NVD3
Summary
References
A. Pattern Matching and Extractors
Pattern matching in for comprehensions
Pattern matching internals
Extracting sequences
Summary
Reference
Index

Content preview from Scala for Data Science

Cross-validation and model selection

In the previous example, we validated our approach by withholding 30% of the data when training, and testing on this subset. This approach is not particularly rigorous: the exact result changes depending on the random train-test split. Furthermore, if we wanted to test several different hyperparameters (or different models) to choose the best one, we would, unwittingly, choose the model that best reflects the specific rows in our test set, rather than the population as a whole.

This can be overcome with cross-validation. We have already encountered cross-validation in Chapter 4, Parallel Collections and Futures. In that chapter, we used random subsample cross-validation, where we created the train-test split ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Scala: Guide for Data Science Professionals

Publisher Resources

ISBN: 9781785281372

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Scala for Data Science

by Pascal Bugnion

Cross-validation and model selection

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Scala: Guide for Data Science Professionals

Scala and Spark for Big Data Analytics

Learning Scala

Scientific Computing with Scala

Publisher Resources