book

Big Data Analytics with Java

by RAJAT MEHTA

July 2017

Beginner to intermediate

418 pages

9h 46m

English

Packt Publishing

Read now

Unlock full access

eBooks, discount offers, and moreWhy subscribe?
What this book covers

Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
Why data analytics on big data?Big data for analyticsBig data – a bigger pay package for Java developersBasics of Hadoop – a Java sub-projectDistributed computing on HadoopHDFS conceptsDesign and architecture of HDFSMain components of HDFSHDFS simple commandsApache SparkConceptsTransformationsActionsSpark Java APISpark samples using Java 8Loading dataData operations – cleansing and mungingAnalyzing data – count, projection, grouping, aggregation, and max/minActions on RDDsPaired RDDsTransformations on paired RDDsSaving dataCollecting and printing resultsExecuting Spark programs on HadoopApache Spark sub-projectsSpark machine learning modulesMLlib Java APIOther machine learning librariesMahout – a popular Java ML libraryDeeplearning4j – a deep learning libraryCompressing dataAvro and Parquet
Datasets
Building SparkConf and contextDataframe and datasetsLoad and parse dataAnalyzing data – the Spark-SQL waySpark SQL for data exploration and analyticsMarket basket analysis – Apriori algorithmFull Apriori algorithm
Efficient market basket analysis using FP-Growth algorithmRunning FP-Growth on Apache Spark
Data visualization with Java JFreeChartUsing charts in big data analytics
All India seasonal and annual average temperature series datasetSimple single Time Series chartMultiple Time Series on a single chart window
When would you use a histogram?How to make histograms using JFreeChart?
PrefuseIVTK Graph toolkitOther libraries
What is machine learning?Real-life examples of machine learningType of machine learningA small sample case study of supervised and unsupervised learningSteps for machine learning problemsChoosing the machine learning modelWhat are the feature types that can be extracted from the datasets?How do you select the best features to train your models?How do you run machine learning analytics on big data?Getting and preparing data in HadoopPreparing the dataFormatting the dataStoring the dataTraining and storing models on big dataApache Spark machine learning APIThe new Spark ML API
Linear regressionWhat is simple linear regression?Where is linear regression used?Predicting house prices using linear regressionDatasetData cleaning and mungingExploring the datasetRunning and testing the linear regression model
Which mathematical functions does logistic regression use?Where is logistic regression used?Predicting heart disease using logistic regressionDatasetData cleaning and mungingData explorationRunning and testing the logistic regression model
Conditional probability
Advantages of Naive BayesDisadvantages of Naive Bayes
Concepts for sentimental analysisTokenizationStop words removalStemmingN-gramsTerm presence and Term FrequencyTF-IDFBag of wordsDatasetData exploration of text dataSentimental analysis on this dataset
What is a decision tree?Building a decision treeChoosing the best features for splitting the datasetsAdvantages of using decision treesDisadvantages of using decision treesDatasetData explorationCleaning and munging the dataTraining and testing the model
EnsemblingTypes of ensemblingBaggingBoostingAdvantages and disadvantages of ensemblingRandom forestsGradient boosted trees (GBTs)Classification problem and dataset usedData explorationTraining and testing our random forest modelTraining and testing our gradient boosted tree model
Recommendation systems and their types
DatasetContent-based recommender on MovieLens datasetCollaborative recommendation systemsAdvantagesDisadvantagesAlternating least square – collaborative filtering
ClusteringTypes of clusteringHierarchical clusteringK-means clusteringBisecting k-means clustering
Changing the clustering algorithm
Refresher on graphsRepresenting graphsCommon terminology on graphsCommon algorithms on graphsPlotting graphs
Graph analyticsGraphFramesBuilding a graph using GraphFramesGraph analytics on airports and their flightsDatasetsGraph analytics on flights data
Real-time analyticsBig data stack for real-time analyticsReal-time SQL queries on big dataReal-time data ingestion and storageReal-time data processingReal-time SQL queries using ImpalaFlight delay analysis using ImpalaApache KafkaSpark StreamingTypical uses of Spark StreamingBase project setupTrending videosSentiment analysis in real time
Introduction to neural networks
Problems with perceptronsSigmoid neuronMulti-layer perceptronsAccuracy of multi-layer perceptrons
Advantages and use cases of deep learning
Diving into the code:More information on deep learning

Content preview from Big Data Analytics with Java

Implementation of the Apriori algorithm in Apache Spark

We have gone through the preceding algorithm. Now we will try to write the entire algorithm in Spark. Spark does not have a default implementation of Apriori algorithm, so we will have to write our own implementation as shown next (refer to the comments in the code as well).

First, we will have the regular boilerplate code to initiate the Spark configuration and context:

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);

Now, we will load the dataset file using the SparkContext and store the result in a JavaRDD instance. We will create the instance of the AprioriUtil class. This class contains the methods for calculating ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Data Science with Java

Michael R. Brzustowicz

Data Science on AWS

Chris Fregly, Antje Barth

Machine Learning: End-to-End guide for Java developers

Richard M. Reese, Jennifer L. Reese, Boštjan Kaluža, Dr. Uday Kamath, Krishna Choppella

Machine Learning for Time-Series with Python

Ben Auffarth

Publisher Resources

ISBN: 9781787288980Supplemental Content