O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Java for Data Science

Book Description

Examine the techniques and Java tools supporting the growing field of data science

About This Book

  • Your entry ticket to the world of data science with the stability and power of Java
  • Explore, analyse, and visualize your data effectively using easy-to-follow examples
  • Make your Java applications more capable using machine learning

Who This Book Is For

This book is for Java developers who are comfortable developing applications in Java. Those who now want to enter the world of data science or wish to build intelligent applications will find this book ideal. Aspiring data scientists will also find this book very helpful.

What You Will Learn

  • Understand the nature and key concepts used in the field of data science
  • Grasp how data is collected, cleaned, and processed
  • Become comfortable with key data analysis techniques
  • See specialized analysis techniques centered on machine learning
  • Master the effective visualization of your data
  • Work with the Java APIs and techniques used to perform data analysis

In Detail

Data science is concerned with extracting knowledge and insights from a wide variety of data sources to analyse patterns or predict future behaviour. It draws from a wide array of disciplines including statistics, computer science, mathematics, machine learning, and data mining. In this book, we cover the important data science concepts and how they are supported by Java, as well as the often statistically challenging techniques, to provide you with an understanding of their purpose and application.

The book starts with an introduction of data science, followed by the basic data science tasks of data collection, data cleaning, data analysis, and data visualization. This is followed by a discussion of statistical techniques and more advanced topics including machine learning, neural networks, and deep learning. The next section examines the major categories of data analysis including text, visual, and audio data, followed by a discussion of resources that support parallel implementation.

The final chapter illustrates an in-depth data science problem and provides a comprehensive, Java-based solution. Due to the nature of the topic, simple examples of techniques are presented early followed by a more detailed treatment later in the book. This permits a more natural introduction to the techniques and concepts presented in the book.

Style and approach

This book follows a tutorial approach, providing examples of each of the major concepts covered.

With a step-by-step instructional style, this book covers various facets of data science and will get you up and running quickly.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Java for Data Science
    1. Java for Data Science
    2. Credits
    3. About the Authors
    4. About the Reviewers
    5. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    6. Customer Feedback
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for 
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Getting Started with Data Science
      1. Problems solved using data science
      2. Understanding the data science problem -  solving approach
        1. Using Java to support data science
      3. Acquiring data for an application
      4. The importance and process of cleaning data
      5. Visualizing data to enhance understanding
      6. The use of statistical methods in data science
      7. Machine learning applied to data science
      8. Using neural networks in data science
      9. Deep learning approaches
      10. Performing text analysis
      11. Visual and audio analysis
      12. Improving application performance using parallel techniques
      13. Assembling the pieces
      14. Summary
    9. 2. Data Acquisition
      1. Understanding the data formats used in data science applications
        1. Overview of CSV data
        2. Overview of spreadsheets
        3. Overview of databases
        4. Overview of PDF files
        5. Overview of JSON
        6. Overview of XML
        7. Overview of streaming data
        8. Overview of audio/video/images in Java
      2. Data acquisition techniques
        1. Using the HttpUrlConnection class
        2. Web crawlers in Java
          1. Creating your own web crawler
          2. Using the crawler4j web crawler
        3. Web scraping in Java
        4. Using API calls to access common social media sites
          1. Using OAuth to authenticate users
          2. Handing Twitter
          3. Handling Wikipedia
          4. Handling Flickr
          5. Handling YouTube
            1. Searching by keyword
      3. Summary
    10. 3. Data Cleaning
      1. Handling data formats
        1. Handling CSV data
        2. Handling spreadsheets
          1. Handling Excel spreadsheets
        3. Handling PDF files
        4. Handling JSON
          1. Using JSON streaming API
          2. Using the JSON tree API
      2. The nitty gritty of cleaning text
        1. Using Java tokenizers to extract words
          1. Java core tokenizers
          2. Third-party tokenizers and libraries
        2. Transforming data into a usable form
          1. Simple text cleaning
          2. Removing stop words
        3. Finding words in text
          1. Finding and replacing text
        4. Data imputation
        5. Subsetting data
        6. Sorting text
        7. Data validation
          1. Validating data types
          2. Validating dates
          3. Validating e-mail addresses
          4. Validating ZIP codes
          5. Validating names
      3. Cleaning images
        1. Changing the contrast of an image
        2. Smoothing an image
        3. Brightening an image
        4. Resizing an image
        5. Converting images to different formats
      4. Summary
    11. 4. Data Visualization
      1. Understanding plots and graphs
        1. Visual analysis goals
      2. Creating index charts
      3. Creating bar charts
        1. Using country as the category
        2. Using decade as the category
      4. Creating stacked graphs
      5. Creating pie charts
      6. Creating scatter charts
      7. Creating histograms
      8. Creating donut charts
      9. Creating bubble charts
      10. Summary
    12. 5. Statistical Data Analysis Techniques
      1. Working with mean, mode, and median
        1. Calculating the mean
          1. Using simple Java techniques to find mean
          2. Using Java 8 techniques to find mean
          3. Using Google Guava to find mean
          4. Using Apache Commons to find mean
        2. Calculating the median
          1. Using simple Java techniques to find median
          2. Using Apache Commons to find the median
        3. Calculating the mode
          1. Using ArrayLists to find multiple modes
          2. Using a HashMap to find multiple modes
          3. Using a Apache Commons to find multiple modes
      2. Standard deviation
      3. Sample size determination
      4. Hypothesis testing
      5. Regression analysis
        1. Using simple linear regression
        2. Using multiple regression
      6. Summary
    13. 6. Machine Learning
      1. Supervised learning techniques
        1. Decision trees
          1. Decision tree types
          2. Decision tree libraries
          3. Using a decision tree with a book dataset
          4. Testing the book decision tree
        2. Support vector machines
          1. Using an SVM for camping data
          2. Testing individual instances
        3. Bayesian networks
          1. Using a Bayesian network
      2. Unsupervised machine learning
        1. Association rule learning
          1. Using association rule learning to find buying relationships
      3. Reinforcement learning
      4. Summary
    14. 7. Neural Networks
      1. Training a neural network
        1. Getting started with neural network architectures
      2. Understanding static neural networks
        1. A basic Java example
      3. Understanding dynamic neural networks
        1. Multilayer perceptron networks
          1. Building the model
          2. Evaluating the model
          3. Predicting other values
          4. Saving and retrieving the model
        2. Learning vector quantization
        3. Self-Organizing Maps
          1. Using a SOM
          2. Displaying the SOM results
      4. Additional network architectures and algorithms
        1. The k-Nearest Neighbors algorithm
        2. Instantaneously trained networks
        3. Spiking neural networks
        4. Cascading neural networks
        5. Holographic associative memory
        6. Backpropagation and neural networks
      5. Summary
    15. 8. Deep Learning
      1. Deeplearning4j architecture
        1. Acquiring and manipulating data
          1. Reading in a CSV file
        2. Configuring and building a model
          1. Using hyperparameters in ND4J
          2. Instantiating the network model
        3. Training a model
        4. Testing a model
      2. Deep learning and regression analysis
        1. Preparing the data
        2. Setting up the class
        3. Reading and preparing the data
        4. Building the model
        5. Evaluating the model
      3. Restricted Boltzmann Machines
        1. Reconstruction in an RBM
        2. Configuring an RBM
      4. Deep autoencoders
        1. Building an autoencoder in DL4J
          1. Configuring the network
          2. Building and training the network
          3. Saving and retrieving a network
          4. Specialized autoencoders
      5. Convolutional networks
        1. Building the model
        2. Evaluating the model
      6. Recurrent Neural Networks
      7. Summary
    16. 9. Text Analysis
      1. Implementing named entity recognition
        1. Using OpenNLP to perform NER
        2. Identifying location entities
      2. Classifying text
        1. Word2Vec and Doc2Vec
        2. Classifying text by labels
        3. Classifying text by similarity
      3. Understanding tagging and POS
        1. Using OpenNLP to identify POS
        2. Understanding POS tags
      4. Extracting relationships from sentences
        1. Using OpenNLP to extract relationships
      5. Sentiment analysis
        1. Downloading and extracting the Word2Vec model
        2. Building our model and classifying text
      6. Summary
    17. 10. Visual and Audio Analysis
      1. Text-to-speech
        1. Using FreeTTS
        2. Getting information about voices
        3. Gathering voice information
      2. Understanding speech recognition
        1. Using CMUPhinx to convert speech to text
        2. Obtaining more detail about the words
      3. Extracting text from an image
        1. Using Tess4j to extract text
      4. Identifying faces
        1. Using OpenCV to detect faces
      5. Classifying visual data
        1. Creating a Neuroph Studio project for classifying visual images
        2. Training the model
      6. Summary
    18. 11. Mathematical and Parallel Techniques for Data Analysis
      1. Implementing basic matrix operations
        1. Using GPUs with DeepLearning4j
      2. Using map-reduce
        1. Using Apache's Hadoop to perform map-reduce
        2. Writing the map method
        3. Writing the reduce method
        4. Creating and executing a new Hadoop job
      3. Various mathematical libraries
        1. Using the jblas API
        2. Using the Apache Commons math API
        3. Using the ND4J API
      4. Using OpenCL
      5. Using Aparapi
        1. Creating an Aparapi application
        2. Using Aparapi for matrix multiplication
      6. Using Java 8 streams
        1. Understanding Java 8 lambda expressions and streams
        2. Using Java 8 to perform matrix multiplication
        3. Using Java 8 to perform map-reduce
      7. Summary
    19. 12. Bringing It All Together
      1. Defining the purpose and scope of our application
      2. Understanding the application's architecture
      3. Data acquisition using Twitter
      4. Understanding the TweetHandler class
        1. Extracting data for a sentiment analysis model
        2. Building the sentiment model
        3. Processing the JSON input
        4. Cleaning data to improve our results
        5. Removing stop words
        6. Performing sentiment analysis
        7. Analysing the results
      5. Other optional enhancements
      6. Summary