Book description
Leverage the power of Scala with different tools to build scalable, robust data science applications
About This Book
- A complete guide for scalable data science solutions, from data ingestion to data visualization
- Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations
- Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided
Who This Book Is For
If you are a Scala developer or data scientist, or if you want to enter the field of data science, then this book will give you all the tools you need to implement data science solutions.
What You Will Learn
- Transform and filter tabular data to extract features for machine learning
- Implement your own algorithms or take advantage of MLLib's extensive suite of models to build distributed machine learning pipelines
- Read, transform, and write data to both SQL and NoSQL databases in a functional manner
- Write robust routines to query web APIs
- Read data from web APIs such as the GitHub or Twitter API
- Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements
- Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
- Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive
In Detail
Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines.
This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala.
Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala's emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks.
This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.
Style and approach
A tutorial with complete examples, this book will give you the tools to start building useful data engineering and data science solutions straightaway
Publisher resources
Table of contents
-
Scala for Data Science
- Table of Contents
- Scala for Data Science
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Preface
- 1. Scala and Data Science
-
2. Manipulating Data with Breeze
- Code examples
- Installing Breeze
- Getting help on Breeze
-
Basic Breeze data types
- Vectors
- Dense and sparse vectors and the vector trait
- Matrices
- Building vectors and matrices
- Advanced indexing and slicing
- Mutating vectors and matrices
- Matrix multiplication, transposition, and the orientation of vectors
- Data preprocessing and feature engineering
- Breeze – function optimization
- Numerical derivatives
- Regularization
- An example – logistic regression
- Towards re-usable code
- Alternatives to Breeze
- Summary
- References
- 3. Plotting with breeze-viz
- 4. Parallel Collections and Futures
-
5. Scala and SQL through JDBC
- Interacting with JDBC
- First steps with JDBC
- JDBC summary
- Functional wrappers for JDBC
- Safer JDBC connections with the loan pattern
- Enriching JDBC statements with the "pimp my library" pattern
- Wrapping result sets in a stream
- Looser coupling with type classes
- Creating a data access layer
- Summary
- References
- 6. Slick – A Functional Interface for SQL
- 7. Web APIs
- 8. Scala and MongoDB
-
9. Concurrency with Akka
- GitHub follower graph
- Actors as people
- Hello world with Akka
- Case classes as messages
- Actor construction
- Anatomy of an actor
- Follower network crawler
- Fetcher actors
- Routing
- Message passing between actors
- Queue control and the pull pattern
- Accessing the sender of a message
- Stateful actors
- Follower network crawler
- Fault tolerance
- Custom supervisor strategies
- Life-cycle hooks
- What we have not talked about
- Summary
- References
- 10. Distributed Batch Processing with Spark
-
11. Spark SQL and DataFrames
- DataFrames – a whirlwind introduction
- Aggregation operations
- Joining DataFrames together
- Custom functions on DataFrames
- DataFrame immutability and persistence
- SQL statements on DataFrames
- Complex data types – arrays, maps, and structs
- Interacting with data sources
- Standalone programs
- Summary
- References
- 12. Distributed Machine Learning with MLlib
-
13. Web APIs with Play
- Client-server applications
- Introduction to web frameworks
- Model-View-Controller architecture
- Single page applications
- Building an application
- The Play framework
- Dynamic routing
- Actions
- Interacting with JSON
- Querying external APIs and consuming JSON
- Creating APIs with Play: a summary
- Rest APIs: best practice
- Summary
- References
- 14. Visualization with D3 and the Play Framework
- A. Pattern Matching and Extractors
- Index
Product information
- Title: Scala for Data Science
- Author(s):
- Release date: January 2016
- Publisher(s): Packt Publishing
- ISBN: 9781785281372
You might also like
book
Hands-On Data Preprocessing in Python
Get your raw data cleaned up and ready for processing to design better data analytic solutions …
book
Generative Deep Learning, 2nd Edition
Generative AI is the hottest topic in tech. This practical book teaches machine learning engineers and …
book
Learning Algorithms
When it comes to writing efficient code, every software professional needs to have an effective working …
book
Mythical Man-Month, The: Essays on Software Engineering, Anniversary Edition, 2nd Edition
Few books on software project management have been as influential and timeless as The Mythical Man-Month. …