Big Data Analytics with Java

Book description

Learn the basics of analytics on big data using Java, machine learning and other big data tools

About This Book

  • Acquire real-world set of tools for building enterprise level data science applications
  • Surpasses the barrier of other languages in data science and learn create useful object-oriented codes
  • Extensive use of Java compliant big data tools like apache spark, Hadoop, etc.

Who This Book Is For

This book is for Java developers who are looking to perform data analysis in production environment. Those who wish to implement data analysis in their Big data applications will find this book helpful.

What You Will Learn

  • Start from simple analytic tasks on big data
  • Get into more complex tasks with predictive analytics on big data using machine learning
  • Learn real time analytic tasks
  • Understand the concepts with examples and case studies
  • Prepare and refine data for analysis
  • Create charts in order to understand the data
  • See various real-world datasets

In Detail

This book covers case studies such as sentiment analysis on a tweet dataset, recommendations on a movielens dataset, customer segmentation on an ecommerce dataset, and graph analysis on actual flights dataset.

This book is an end-to-end guide to implement analytics on big data with Java. Java is the de facto language for major big data environments, including Hadoop. This book will teach you how to perform analytics on big data with production-friendly Java. This book basically divided into two sections. The first part is an introduction that will help the readers get acquainted with big data environments, whereas the second part will contain a hardcore discussion on all the concepts in analytics on big data. It will take you from data analysis and data visualization to the core concepts and advantages of machine learning, real-life usage of regression and classification using Naïve Bayes, a deep discussion on the concepts of clustering,and a review of simple neural networks on big data using deepLearning4j or plain Java Spark code. This book is a must-have book for Java developers who want to start learning big data analytics and want to use it in the real world.

Style and approach

The approach of book is to deliver practical learning modules in manageable content. Each chapter is a self-contained unit of a concept in big data analytics. Book will step by step builds the competency in the area of big data analytics. Examples using real world case studies to give ideas of real applications and how to use the techniques mentioned. The examples and case studies will be shown using both theory and code.

Publisher resources

Download Example Code

Table of contents

  1. Big Data Analytics with Java
    1. Table of Contents
    2. Big Data Analytics with Java
    3. Credits
    4. About the Author
    5. About the Reviewers
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Big Data Analytics with Java
      1. Why data analytics on big data?
        1. Big data for analytics
          1. Big data – a bigger pay package for Java developers
          2. Basics of Hadoop – a Java sub-project
        2. Distributed computing on Hadoop
        3. HDFS concepts
          1. Design and architecture of HDFS
          2. Main components of HDFS
          3. HDFS simple commands
        4. Apache Spark
          1. Concepts
          2. Transformations
          3. Actions
          4. Spark Java API
          5. Spark samples using Java 8
          6. Loading data
          7. Data operations – cleansing and munging
          8. Analyzing data – count, projection, grouping, aggregation, and max/min
          9. Actions on RDDs
          10. Paired RDDs
            1. Transformations on paired RDDs
          11. Saving data
          12. Collecting and printing results
          13. Executing Spark programs on Hadoop
          14. Apache Spark sub-projects
          15. Spark machine learning modules
            1. MLlib Java API
            2. Other machine learning libraries
          16. Mahout – a popular Java ML library
          17. Deeplearning4j – a deep learning library
            1. Compressing data
            2. Avro and Parquet
      2. Summary
    10. 2. First Steps in Data Analysis
      1. Datasets
      2. Data cleaning and munging
      3. Basic analysis of data with Spark SQL
        1. Building SparkConf and context
        2. Dataframe and datasets
        3. Load and parse data
          1. Analyzing data – the Spark-SQL way
          2. Spark SQL for data exploration and analytics
          3. Market basket analysis – Apriori algorithm
            1. Full Apriori algorithm
      4. Implementation of the Apriori algorithm in Apache Spark
        1. Efficient market basket analysis using FP-Growth algorithm
          1. Running FP-Growth on Apache Spark
      5. Summary
    11. 3. Data Visualization
      1. Data visualization with Java JFreeChart
        1. Using charts in big data analytics
      2. Time Series chart
        1. All India seasonal and annual average temperature series dataset
        2. Simple single Time Series chart
        3. Multiple Time Series on a single chart window
      3. Bar charts
      4. Histograms
        1. When would you use a histogram?
        2. How to make histograms using JFreeChart?
      5. Line charts
      6. Scatter plots
      7. Box plots
      8. Advanced visualization technique
        1. Prefuse
        2. IVTK Graph toolkit
          1. Other libraries
      9. Summary
    12. 4. Basics of Machine Learning
      1. What is machine learning?
        1. Real-life examples of machine learning
        2. Type of machine learning
          1. A small sample case study of supervised and unsupervised learning
        3. Steps for machine learning problems
        4. Choosing the machine learning model
          1. What are the feature types that can be extracted from the datasets?
          2. How do you select the best features to train your models?
          3. How do you run machine learning analytics on big data?
          4. Getting and preparing data in Hadoop
            1. Preparing the data
            2. Formatting the data
            3. Storing the data
          5. Training and storing models on big data
          6. Apache Spark machine learning API
            1. The new Spark ML API
      2. Summary
    13. 5. Regression on Big Data
      1. Linear regression
        1. What is simple linear regression?
          1. Where is linear regression used?
          2. Predicting house prices using linear regression
            1. Dataset
              1. Data cleaning and munging
              2. Exploring the dataset
              3. Running and testing the linear regression model
      2. Logistic regression
        1. Which mathematical functions does logistic regression use?
          1. Where is logistic regression used?
          2. Predicting heart disease using logistic regression
            1. Dataset
              1. Data cleaning and munging
              2. Data exploration
              3. Running and testing the logistic regression model
      3. Summary
    14. 6. Naive Bayes and Sentiment Analysis
      1. Conditional probability
      2. Bayes theorem
      3. Naive Bayes algorithm
        1. Advantages of Naive Bayes
        2. Disadvantages of Naive Bayes
      4. Sentimental analysis
        1. Concepts for sentimental analysis
          1. Tokenization
          2. Stop words removal
          3. Stemming
          4. N-grams
          5. Term presence and Term Frequency
          6. TF-IDF
          7. Bag of words
          8. Dataset
          9. Data exploration of text data
        2. Sentimental analysis on this dataset
      5. SVM or Support Vector Machine
      6. Summary
    15. 7. Decision Trees
      1. What is a decision tree?
        1. Building a decision tree
          1. Choosing the best features for splitting the datasets
            1. Advantages of using decision trees
            2. Disadvantages of using decision trees
          2. Dataset
          3. Data exploration
          4. Cleaning and munging the data
          5. Training and testing the model
      2. Summary
    16. 8. Ensembling on Big Data
      1. Ensembling
        1. Types of ensembling
          1. Bagging
          2. Boosting
          3. Advantages and disadvantages of ensembling
        2. Random forests
        3. Gradient boosted trees (GBTs)
          1. Classification problem and dataset used
          2. Data exploration
          3. Training and testing our random forest model
          4. Training and testing our gradient boosted tree model
      2. Summary
    17. 9. Recommendation Systems
      1. Recommendation systems and their types
      2. Content-based recommendation systems
        1. Dataset
        2. Content-based recommender on MovieLens dataset
        3. Collaborative recommendation systems
          1. Advantages
          2. Disadvantages
          3. Alternating least square – collaborative filtering
      3. Summary
    18. 10. Clustering and Customer Segmentation on Big Data
      1. Clustering
        1. Types of clustering
          1. Hierarchical clustering
          2. K-means clustering
          3. Bisecting k-means clustering
      2. Customer segmentation
      3. Dataset
      4. Data exploration
      5. Clustering for customer segmentation
        1. Changing the clustering algorithm
      6. Summary
    19. 11. Massive Graphs on Big Data
      1. Refresher on graphs
        1. Representing graphs
          1. Common terminology on graphs
          2. Common algorithms on graphs
          3. Plotting graphs
      2. Massive graphs on big data
        1. Graph analytics
          1. GraphFrames
          2. Building a graph using GraphFrames
        2. Graph analytics on airports and their flights
          1. Datasets
          2. Graph analytics on flights data
      3. Summary
    20. 12. Real-Time Analytics on Big Data
      1. Real-time analytics
        1. Big data stack for real-time analytics
        2. Real-time SQL queries on big data
        3. Real-time data ingestion and storage
        4. Real-time data processing
        5. Real-time SQL queries using Impala
          1. Flight delay analysis using Impala
          2. Apache Kafka
          3. Spark Streaming
            1. Typical uses of Spark Streaming
            2. Base project setup
          4. Trending videos
            1. Sentiment analysis in real time
      2. Summary
    21. 13. Deep Learning Using Big Data
      1. Introduction to neural networks
      2. Perceptron
        1. Problems with perceptrons
        2. Sigmoid neuron
        3. Multi-layer perceptrons
          1. Accuracy of multi-layer perceptrons
      3. Deep learning
        1. Advantages and use cases of deep learning
      4. Flower species classification using multi-Layer perceptrons
      5. Deeplearning4j
      6. Hand written digit recognizition using CNN
        1. Diving into the code:
          1. More information on deep learning
      7. Summary
    22. Index

Product information

  • Title: Big Data Analytics with Java
  • Author(s): Rajat Mehta
  • Release date: July 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781787288980