O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Java: Data Science Made Easy

Book Description

Data collection, processing, analysis, and more

About This Book

  • Your entry ticket to the world of data science with the stability and power of Java
  • Explore, analyse, and visualize your data effectively using easy-to-follow examples
  • A highly practical course covering a broad set of topics - from the basics of Machine Learning to Deep Learning and Big Data frameworks.

Who This Book Is For

This course is meant for Java developers who are comfortable developing applications in Java, and now want to enter the world of data science or wish to build intelligent applications. Aspiring data scientists with some understanding of the Java programming language will also find this book to be very helpful. If you are willing to build efficient data science applications and bring them in the enterprise environment without changing your existing Java stack, this book is for you!

What You Will Learn

  • Understand the key concepts of data science
  • Explore the data science ecosystem available in Java
  • Work with the Java APIs and techniques used to perform efficient data analysis
  • Find out how to approach different machine learning problems with Java
  • Process unstructured information such as natural language text or images, and create your own search
  • Learn how to build deep neural networks with DeepLearning4j
  • Build data science applications that scale and process large amounts of data
  • Deploy data science models to production and evaluate their performance

In Detail

Data science is concerned with extracting knowledge and insights from a wide variety of data sources to analyse patterns or predict future behaviour. It draws from a wide array of disciplines including statistics, computer science, mathematics, machine learning, and data mining. In this course, we cover the basic as well as advanced data science concepts and how they are implemented using the popular Java tools and libraries.The course starts with an introduction of data science, followed by the basic data science tasks of data collection, data cleaning, data analysis, and data visualization. This is followed by a discussion of statistical techniques and more advanced topics including machine learning, neural networks, and deep learning. You will examine the major categories of data analysis including text, visual, and audio data, followed by a discussion of resources that support parallel implementation. Throughout this course, the chapters will illustrate a challenging data science problem, and then go on to present a comprehensive, Java-based solution to tackle that problem. You will cover a wide range of topics ? from classification and regression, to dimensionality reduction and clustering, deep learning and working with Big Data. Finally, you will see the different ways to deploy the model and evaluate it in production settings.

By the end of this course, you will be up and running with various facets of data science using Java, in no time at all.

This course contains premium content from two of our recently published popular titles:

  • Java for Data Science
  • Mastering Java for Data Science

Style and approach

This course follows a tutorial approach, providing examples of each of the concepts covered. With a step-by-step instructional style, this book covers various facets of data science and will get you up and running quickly.

Table of Contents

  1. Preface
    1. What this learning path covers
    2. What you need for this learning path
    3. Who this learning path is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Module 1
  3. Getting Started with Data Science
    1. Problems solved using data science
    2. Understanding the data science problem -  solving approach
      1. Using Java to support data science
    3. Acquiring data for an application
    4. The importance and process of cleaning data
    5. Visualizing data to enhance understanding
    6. The use of statistical methods in data science
    7. Machine learning applied to data science
    8. Using neural networks in data science
    9. Deep learning approaches
    10. Performing text analysis
    11. Visual and audio analysis
    12. Improving application performance using parallel techniques
    13. Assembling the pieces
    14. Summary
  4. Data Acquisition
    1. Understanding the data formats used in data science applications
      1. Overview of CSV data
      2. Overview of spreadsheets
      3. Overview of databases
      4. Overview of PDF files
      5. Overview of JSON
      6. Overview of XML
      7. Overview of streaming data
      8. Overview of audio/video/images in Java
    2. Data acquisition techniques
      1. Using the HttpUrlConnection class
      2. Web crawlers in Java
        1. Creating your own web crawler
        2. Using the crawler4j web crawler
      3. Web scraping in Java
      4. Using API calls to access common social media sites
        1. Using OAuth to authenticate users
        2. Handing Twitter
        3. Handling Wikipedia
        4. Handling Flickr
        5. Handling YouTube
          1. Searching by keyword
    3. Summary
  5. Data Cleaning
    1. Handling data formats
      1. Handling CSV data
      2. Handling spreadsheets
        1. Handling Excel spreadsheets
      3. Handling PDF files
      4. Handling JSON
        1. Using JSON streaming API
        2. Using the JSON tree API
    2. The nitty gritty of cleaning text
      1. Using Java tokenizers to extract words
        1. Java core tokenizers
        2. Third-party tokenizers and libraries
      2. Transforming data into a usable form
        1. Simple text cleaning
        2. Removing stop words
      3. Finding words in text
        1. Finding and replacing text
      4. Data imputation
      5. Subsetting data
      6. Sorting text
      7. Data validation
        1. Validating data types
        2. Validating dates
        3. Validating e-mail addresses
        4. Validating ZIP codes
        5. Validating names
    3. Cleaning images
      1. Changing the contrast of an image
      2. Smoothing an image
      3. Brightening an image
      4. Resizing an image
      5. Converting images to different formats
    4. Summary
  6. Data Visualization
    1. Understanding plots and graphs
      1. Visual analysis goals
    2. Creating index charts
    3. Creating bar charts
      1. Using country as the category
      2. Using decade as the category
    4. Creating stacked graphs
    5. Creating pie charts
    6. Creating scatter charts
    7. Creating histograms
    8. Creating donut charts
    9. Creating bubble charts
    10. Summary
  7. Statistical Data Analysis Techniques
    1. Working with mean, mode, and median
      1. Calculating the mean
        1. Using simple Java techniques to find mean
        2. Using Java 8 techniques to find mean
        3. Using Google Guava to find mean
        4. Using Apache Commons to find mean
      2. Calculating the median
        1. Using simple Java techniques to find median
        2. Using Apache Commons to find the median
      3. Calculating the mode
        1. Using ArrayLists to find multiple modes
        2. Using a HashMap to find multiple modes
        3. Using a Apache Commons to find multiple modes
    2. Standard deviation
    3. Sample size determination
    4. Hypothesis testing
    5. Regression analysis
      1. Using simple linear regression
      2. Using multiple regression
    6. Summary
  8. Machine Learning
    1. Supervised learning techniques
      1. Decision trees
        1. Decision tree types
        2. Decision tree libraries
        3. Using a decision tree with a book dataset
        4. Testing the book decision tree
      2. Support vector machines
        1. Using an SVM for camping data
        2. Testing individual instances
      3. Bayesian networks
        1. Using a Bayesian network
    2. Unsupervised machine learning
      1. Association rule learning
        1. Using association rule learning to find buying relationships
    3. Reinforcement learning
    4. Summary
  9. Neural Networks
    1. Training a neural network
      1. Getting started with neural network architectures
    2. Understanding static neural networks
      1. A basic Java example
    3. Understanding dynamic neural networks
      1. Multilayer perceptron networks
        1. Building the model
        2. Evaluating the model
        3. Predicting other values
        4. Saving and retrieving the model
      2. Learning vector quantization
      3. Self-Organizing Maps
        1. Using a SOM
        2. Displaying the SOM results
    4. Additional network architectures and algorithms
      1. The k-Nearest Neighbors algorithm
      2. Instantaneously trained networks
      3. Spiking neural networks
      4. Cascading neural networks
      5. Holographic associative memory
      6. Backpropagation and neural networks
    5. Summary
  10. Deep Learning
    1. Deeplearning4j architecture
      1. Acquiring and manipulating data
        1. Reading in a CSV file
      2. Configuring and building a model
        1. Using hyperparameters in ND4J
        2. Instantiating the network model
      3. Training a model
      4. Testing a model
    2. Deep learning and regression analysis
      1. Preparing the data
      2. Setting up the class
      3. Reading and preparing the data
      4. Building the model
      5. Evaluating the model
    3. Restricted Boltzmann Machines
      1. Reconstruction in an RBM
      2. Configuring an RBM
    4. Deep autoencoders
      1. Building an autoencoder in DL4J
        1. Configuring the network
        2. Building and training the network
        3. Saving and retrieving a network
        4. Specialized autoencoders
    5. Convolutional networks
      1. Building the model
      2. Evaluating the model
    6. Recurrent Neural Networks
    7. Summary
  11. Text Analysis
    1. Implementing named entity recognition
      1. Using OpenNLP to perform NER
      2. Identifying location entities
    2. Classifying text
      1. Word2Vec and Doc2Vec
      2. Classifying text by labels
      3. Classifying text by similarity
    3. Understanding tagging and POS
      1. Using OpenNLP to identify POS
      2. Understanding POS tags
    4. Extracting relationships from sentences
      1. Using OpenNLP to extract relationships
    5. Sentiment analysis
      1. Downloading and extracting the Word2Vec model
      2. Building our model and classifying text
    6. Summary
  12. Visual and Audio Analysis
    1. Text-to-speech
      1. Using FreeTTS
      2. Getting information about voices
      3. Gathering voice information
    2. Understanding speech recognition
      1. Using CMUPhinx to convert speech to text
      2. Obtaining more detail about the words
    3. Extracting text from an image
      1. Using Tess4j to extract text
    4. Identifying faces
      1. Using OpenCV to detect faces
    5. Classifying visual data
      1. Creating a Neuroph Studio project for classifying visual images
      2. Training the model
    6. Summary
  13. Mathematical and Parallel Techniques for Data Analysis
    1. Implementing basic matrix operations
      1. Using GPUs with DeepLearning4j
    2. Using map-reduce
      1. Using Apache's Hadoop to perform map-reduce
      2. Writing the map method
      3. Writing the reduce method
      4. Creating and executing a new Hadoop job
    3. Various mathematical libraries
      1. Using the jblas API
      2. Using the Apache Commons math API
      3. Using the ND4J API
    4. Using OpenCL
    5. Using Aparapi
      1. Creating an Aparapi application
      2. Using Aparapi for matrix multiplication
    6. Using Java 8 streams
      1. Understanding Java 8 lambda expressions and streams
      2. Using Java 8 to perform matrix multiplication
      3. Using Java 8 to perform map-reduce
    7. Summary
  14. Bringing It All Together
    1. Defining the purpose and scope of our application
    2. Understanding the application's architecture
    3. Data acquisition using Twitter
    4. Understanding the TweetHandler class
      1. Extracting data for a sentiment analysis model
      2. Building the sentiment model
      3. Processing the JSON input
      4. Cleaning data to improve our results
      5. Removing stop words
      6. Performing sentiment analysis
      7. Analysing the results
    5. Other optional enhancements
    6. Summary
  15. Module 2
  16. Data Science Using Java
    1. Data science
      1. Machine learning
        1. Supervised learning
        2. Unsupervised learning
          1. Clustering
          2. Dimensionality reduction
        3. Natural Language Processing
    2. Data science process models
      1. CRISP-DM
      2. A running example
    3. Data science in Java
      1. Data science libraries
        1. Data processing libraries
        2. Math and stats libraries
        3. Machine learning and data mining libraries
        4. Text processing
    4. Summary
  17. Data Processing Toolbox
    1. Standard Java library
      1. Collections
      2. Input/Output
        1. Reading input data
        2. Writing ouput data
      3. Streaming API
    2. Extensions to the standard library
      1. Apache Commons
        1. Commons Lang
        2. Commons IO
        3. Commons Collections
        4. Other commons modules
      2. Google Guava
      3. AOL Cyclops React
    3. Accessing data
      1. Text data and CSV
      2. Web and HTML
      3. JSON
      4. Databases
      5. DataFrames
    4. Search engine - preparing data
    5. Summary
  18. Exploratory Data Analysis
    1. Exploratory data analysis in Java
      1. Search engine datasets
      2. Apache Commons Math
      3. Joinery
    2. Interactive Exploratory Data Analysis in Java
      1. JVM languages
        1. Interactive Java
      2. Joinery shell
    3. Summary
  19. Supervised Learning - Classification and Regression
    1. Classification
      1. Binary classification models
        1. Smile
        2. JSAT
        3. LIBSVM and LIBLINEAR
        4. Encog
      2. Evaluation
        1. Accuracy
        2. Precision, recall, and F1
        3. ROC and AU ROC (AUC)
        4. Result validation
        5. K-fold cross-validation
        6. Training, validation, and testing
    2. Case study - page prediction
    3. Regression
      1. Machine learning libraries for regression
        1. Smile
        2. JSAT
        3. Other libraries
      2. Evaluation
        1. MSE
        2. MAE
    4. Case study - hardware performance
    5. Summary
  20. Unsupervised Learning - Clustering and Dimensionality Reduction
    1. Dimensionality reduction
      1. Unsupervised dimensionality reduction
      2. Principal Component Analysis
      3. Truncated SVD
      4. Truncated SVD for categorical and sparse data
        1. Random projection
    2. Cluster analysis
      1. Hierarchical methods
      2. K-means
        1. Choosing K in K-Means
        2. DBSCAN
      3. Clustering for supervised learning
        1. Clusters as features
        2. Clustering as dimensionality reduction
        3. Supervised learning via clustering
      4. Evaluation
        1. Manual evaluation
        2. Supervised evaluation
        3. Unsupervised Evaluation
    3. Summary
  21. Working with Text - Natural Language Processing and Information Retrieval
    1. Natural Language Processing and information retrieval
      1. Vector Space Model - Bag of Words and TF-IDF
        1. Vector space model implementation
      2. Indexing and Apache Lucene
      3. Natural Language Processing tools
        1. Stanford CoreNLP
      4. Customizing Apache Lucene
    2. Machine learning for texts
      1. Unsupervised learning for texts
        1. Latent Semantic Analysis
        2. Text clustering
        3. Word embeddings
      2. Supervised learning for texts 
      3. Text classification
      4. Learning to rank for information retrieval
        1. Reranking with Lucene
    3. Summary
  22. Extreme Gradient Boosting
    1. Gradient Boosting Machines and XGBoost
      1. Installing XGBoost
    2. XGBoost in practice
      1. XGBoost for classification
        1. Parameter tuning
        2. Text features
        3. Feature importance
      2. XGBoost for regression
      3. XGBoost for learning to rank
    3. Summary
  23. Deep Learning with DeepLearning4J
    1. Neural Networks and DeepLearning4J
      1. ND4J - N-dimensional arrays for Java
      2. Neural networks in DeepLearning4J
      3. Convolutional Neural Networks
    2. Deep learning for cats versus dogs
      1. Reading the data
      2. Creating the model
      3. Monitoring the performance
      4. Data augmentation
      5. Running DeepLearning4J on GPU
    3. Summary
  24. Scaling Data Science
    1. Apache Hadoop
      1. Hadoop MapReduce
      2. Common Crawl
    2. Apache Spark
    3. Link prediction
      1. Reading the DBLP graph
      2. Extracting features from the graph
      3. Node features
      4. Negative sampling
      5. Edge features
      6. Link Prediction with MLlib and XGBoost
      7. Link suggestion
    4. Summary
  25. Deploying Data Science Models
    1. Microservices
      1. Spring Boot
      2. Search engine service
    2. Online evaluation
      1. A/B testing
      2. Multi-armed bandits
    3. Summary
  26. Bibliography