O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Java for Data Science

Book Description

Use Java to create a diverse range of Data Science applications and bring Data Science into production

About This Book

  • An overview of modern Data Science and Machine Learning libraries available in Java
  • Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.
  • Easy-to-follow illustrations and the running example of building a search engine.

Who This Book Is For

This book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it.

If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!

What You Will Learn

  • Get a solid understanding of the data processing toolbox available in Java
  • Explore the data science ecosystem available in Java
  • Find out how to approach different machine learning problems with Java
  • Process unstructured information such as natural language text or images
  • Create your own search engine
  • Get state-of-the-art performance with XGBoost
  • Learn how to build deep neural networks with DeepLearning4j
  • Build applications that scale and process large amounts of data
  • Deploy data science models to production and evaluate their performance

In Detail

Java is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises.

Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort.

This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data.

Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.

Style and approach

This is a practical guide where all the important concepts such as classification, regression, and dimensionality reduction are explained with the help of examples.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Data Science Using Java
    1. Data science
      1. Machine learning
        1. Supervised learning
        2. Unsupervised learning
          1. Clustering
          2. Dimensionality reduction
        3. Natural Language Processing
    2. Data science process models
      1. CRISP-DM
      2. A running example
    3. Data science in Java
      1. Data science libraries
        1. Data processing libraries
        2. Math and stats libraries
        3. Machine learning and data mining libraries
        4. Text processing
    4. Summary
  3. Data Processing Toolbox
    1. Standard Java library
      1. Collections
      2. Input/Output
        1. Reading input data
        2. Writing ouput data
      3. Streaming API
    2. Extensions to the standard library
      1. Apache Commons
        1. Commons Lang
        2. Commons IO
        3. Commons Collections
        4. Other commons modules
      2. Google Guava
      3. AOL Cyclops React
    3. Accessing data
      1. Text data and CSV
      2. Web and HTML
      3. JSON
      4. Databases
      5. DataFrames
    4. Search engine - preparing data
    5. Summary
  4. Exploratory Data Analysis
    1. Exploratory data analysis in Java
      1. Search engine datasets
      2. Apache Commons Math
      3. Joinery
    2. Interactive Exploratory Data Analysis in Java
      1. JVM languages
        1. Interactive Java
      2. Joinery shell
    3. Summary
  5. Supervised Learning - Classification and Regression
    1. Classification
      1. Binary classification models
        1. Smile
        2. JSAT
        3. LIBSVM and LIBLINEAR
        4. Encog
      2. Evaluation
        1. Accuracy
        2. Precision, recall, and F1
        3. ROC and AU ROC (AUC)
        4. Result validation
        5. K-fold cross-validation
        6. Training, validation, and testing
    2. Case study - page prediction
    3. Regression
      1. Machine learning libraries for regression
        1. Smile
        2. JSAT
        3. Other libraries
      2. Evaluation
        1. MSE
        2. MAE
    4. Case study - hardware performance
    5. Summary
  6. Unsupervised Learning - Clustering and Dimensionality Reduction
    1. Dimensionality reduction
      1. Unsupervised dimensionality reduction
      2. Principal Component Analysis
      3. Truncated SVD
      4. Truncated SVD for categorical and sparse data
        1. Random projection
    2. Cluster analysis
      1. Hierarchical methods
      2. K-means
        1. Choosing K in K-Means
        2. DBSCAN
      3. Clustering for supervised learning
        1. Clusters as features
        2. Clustering as dimensionality reduction
        3. Supervised learning via clustering
      4. Evaluation
        1. Manual evaluation
        2. Supervised evaluation
        3. Unsupervised Evaluation
    3. Summary
  7. Working with Text - Natural Language Processing and Information Retrieval
    1. Natural Language Processing and information retrieval
      1. Vector Space Model - Bag of Words and TF-IDF
        1. Vector space model implementation
      2. Indexing and Apache Lucene
      3. Natural Language Processing tools
        1. Stanford CoreNLP
      4. Customizing Apache Lucene
    2. Machine learning for texts
      1. Unsupervised learning for texts
        1. Latent Semantic Analysis
        2. Text clustering
        3. Word embeddings
      2. Supervised learning for texts
      3. Text classification
      4. Learning to rank for information retrieval
        1. Reranking with Lucene
    3. Summary
  8. Extreme Gradient Boosting
    1. Gradient Boosting Machines and XGBoost
      1. Installing XGBoost
    2. XGBoost in practice
      1. XGBoost for classification
        1. Parameter tuning
        2. Text features
        3. Feature importance
      2. XGBoost for regression
      3. XGBoost for learning to rank
    3. Summary
  9. Deep Learning with DeepLearning4J
    1. Neural Networks and DeepLearning4J
      1. ND4J - N-dimensional arrays for Java
      2. Neural networks in DeepLearning4J
      3. Convolutional Neural Networks
    2. Deep learning for cats versus dogs
      1. Reading the data
      2. Creating the model
      3. Monitoring the performance
      4. Data augmentation
      5. Running DeepLearning4J on GPU
    3. Summary
  10. Scaling Data Science
    1. Apache Hadoop
      1. Hadoop MapReduce
      2. Common Crawl
    2. Apache Spark
    3. Link prediction
      1. Reading the DBLP graph
      2. Extracting features from the graph
      3. Node features
      4. Negative sampling
      5. Edge features
      6. Link Prediction with MLlib and XGBoost
      7. Link suggestion
    4. Summary
  11. Deploying Data Science Models
    1. Microservices
      1. Spring Boot
      2. Search engine service
    2. Online evaluation
      1. A/B testing
      2. Multi-armed bandits
    3. Summary