book

Big Data Analytics with Java

by RAJAT MEHTA

July 2017

Beginner to intermediate

418 pages

9h 46m

English

Packt Publishing

Read now

Unlock full access

eBooks, discount offers, and moreWhy subscribe?
What this book covers

Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
Why data analytics on big data?Big data for analyticsBig data – a bigger pay package for Java developersBasics of Hadoop – a Java sub-projectDistributed computing on HadoopHDFS conceptsDesign and architecture of HDFSMain components of HDFSHDFS simple commandsApache SparkConceptsTransformationsActionsSpark Java APISpark samples using Java 8Loading dataData operations – cleansing and mungingAnalyzing data – count, projection, grouping, aggregation, and max/minActions on RDDsPaired RDDsTransformations on paired RDDsSaving dataCollecting and printing resultsExecuting Spark programs on HadoopApache Spark sub-projectsSpark machine learning modulesMLlib Java APIOther machine learning librariesMahout – a popular Java ML libraryDeeplearning4j – a deep learning libraryCompressing dataAvro and Parquet
Datasets
Building SparkConf and contextDataframe and datasetsLoad and parse dataAnalyzing data – the Spark-SQL waySpark SQL for data exploration and analyticsMarket basket analysis – Apriori algorithmFull Apriori algorithm
Efficient market basket analysis using FP-Growth algorithmRunning FP-Growth on Apache Spark
Data visualization with Java JFreeChartUsing charts in big data analytics
All India seasonal and annual average temperature series datasetSimple single Time Series chartMultiple Time Series on a single chart window
When would you use a histogram?How to make histograms using JFreeChart?
PrefuseIVTK Graph toolkitOther libraries
What is machine learning?Real-life examples of machine learningType of machine learningA small sample case study of supervised and unsupervised learningSteps for machine learning problemsChoosing the machine learning modelWhat are the feature types that can be extracted from the datasets?How do you select the best features to train your models?How do you run machine learning analytics on big data?Getting and preparing data in HadoopPreparing the dataFormatting the dataStoring the dataTraining and storing models on big dataApache Spark machine learning APIThe new Spark ML API
Linear regressionWhat is simple linear regression?Where is linear regression used?Predicting house prices using linear regressionDatasetData cleaning and mungingExploring the datasetRunning and testing the linear regression model
Which mathematical functions does logistic regression use?Where is logistic regression used?Predicting heart disease using logistic regressionDatasetData cleaning and mungingData explorationRunning and testing the logistic regression model
Conditional probability
Advantages of Naive BayesDisadvantages of Naive Bayes
Concepts for sentimental analysisTokenizationStop words removalStemmingN-gramsTerm presence and Term FrequencyTF-IDFBag of wordsDatasetData exploration of text dataSentimental analysis on this dataset
What is a decision tree?Building a decision treeChoosing the best features for splitting the datasetsAdvantages of using decision treesDisadvantages of using decision treesDatasetData explorationCleaning and munging the dataTraining and testing the model
EnsemblingTypes of ensemblingBaggingBoostingAdvantages and disadvantages of ensemblingRandom forestsGradient boosted trees (GBTs)Classification problem and dataset usedData explorationTraining and testing our random forest modelTraining and testing our gradient boosted tree model
Recommendation systems and their types
DatasetContent-based recommender on MovieLens datasetCollaborative recommendation systemsAdvantagesDisadvantagesAlternating least square – collaborative filtering
ClusteringTypes of clusteringHierarchical clusteringK-means clusteringBisecting k-means clustering
Changing the clustering algorithm
Refresher on graphsRepresenting graphsCommon terminology on graphsCommon algorithms on graphsPlotting graphs
Graph analyticsGraphFramesBuilding a graph using GraphFramesGraph analytics on airports and their flightsDatasetsGraph analytics on flights data
Real-time analyticsBig data stack for real-time analyticsReal-time SQL queries on big dataReal-time data ingestion and storageReal-time data processingReal-time SQL queries using ImpalaFlight delay analysis using ImpalaApache KafkaSpark StreamingTypical uses of Spark StreamingBase project setupTrending videosSentiment analysis in real time
Introduction to neural networks
Problems with perceptronsSigmoid neuronMulti-layer perceptronsAccuracy of multi-layer perceptrons
Advantages and use cases of deep learning
Diving into the code:More information on deep learning

Content preview from Big Data Analytics with Java

Data exploration

In this section, we will explore this dataset and try to perform some simple and useful analytics on top of this dataset.

First, we will create the boilerplate code for Spark configuration and the Spark session:

SparkConf conf = ...
SparkSession session = ...

Next, we will load the dataset and find the number of rows in it:

Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail.csv");

This will print the number of rows in the dataset as:

Number of rows --> 541909

As you can see, this is not a very small dataset but it is not big data either. Big data can run into terabytes. We have seen the number of rows, so let's look at the first few rows now.