book

Java Data Analysis

Name: Java Data Analysis
Author: John R. Hubbard
ISBN: 9781787285651

by John R. Hubbard

September 2017

Beginner to intermediate

412 pages

8h 55m

English

Packt Publishing

Read now

Unlock full access

Java Data Analysis
Table of Contents
Java Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and moreWhy subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book

Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Introduction to Data Analysis
Origins of data analysis
The scientific method
Actuarial science
Calculated by steam
A spectacular example
Herman Hollerith
ENIAC
VisiCalc
Data, information, and knowledge
Why Java?
Java Integrated Development Environments
Summary
2. Data Preprocessing
Data types
Variables
Data points and datasets
Null values
Relational database tables
Key fieldsKey-value pairs
Hash tables
File formats
Microsoft Excel dataXML and JSON data
Generating test datasets
MetadataData cleaningData scalingData filteringSortingMergingHashing
Summary
3. Data Visualization
Tables and graphsScatter plotsLine graphsBar chartsHistograms
Time series
Java implementation
Moving average
Data ranking
Frequency distributions
The normal distribution
A thought experiment
The exponential distribution
Java example
Summary
4. Statistics
Descriptive statistics
Random sampling
Random variables
Probability distributions
Cumulative distributions
The binomial distribution
Multivariate distributions
Conditional probability
The independence of probabilistic events
Contingency tables
Bayes' theorem
Covariance and correlation
The standard normal distribution
The central limit theorem
Confidence intervals
Hypothesis testing
Summary
5. Relational Databases
The relation data model
Relational databases
Foreign keys
Relational database design
Creating a databaseSQL commandsInserting data into the databaseDatabase queriesSQL data typesJDBCUsing a JDBC PreparedStatementBatch processingDatabase viewsSubqueriesTable indexes
Summary
6. Regression Analysis
Linear regressionLinear regression in ExcelComputing the regression coefficientsVariation statisticsJava implementation of linear regressionAnscombe's quartet
Polynomial regression
Multiple linear regressionThe Apache Commons implementationCurve fitting
Summary
7. Classification Analysis
Decision treesWhat does entropy have to do with it?The ID3 algorithmJava Implementation of the ID3 algorithmThe Weka platformThe ARFF filetype for dataJava implementation with Weka
Bayesian classifiers
Java implementation with WekaSupport vector machine algorithms
Logistic regression
K-Nearest NeighborsFuzzy classification algorithms
Summary
8. Cluster Analysis
Measuring distances
The curse of dimensionality
Hierarchical clustering
Weka implementationK-means clusteringK-medoids clusteringAffinity propagation clustering
Summary
9. Recommender Systems
Utility matrices
Similarity measures
Cosine similarity
A simple recommender system
Amazon's item-to-item collaborative filtering recommender
Implementing user ratings
Large sparse matrices
Using random access files
The Netflix prize
Summary
10. NoSQL Databases
The Map data structure
SQL versus NoSQL
The Mongo database system
The Library database
Java development with MongoDB
The MongoDB extension for geospatial databases
Indexing in MongoDB
Why NoSQL and why MongoDB?
Other NoSQL database systems
Summary
11. Big Data Analysis with Java
Scaling, data striping, and sharding
Google's PageRank algorithm
Google's MapReduce framework
Some examples of MapReduce applications
The WordCount example
Scalability
Matrix multiplication with MapReduce
MapReduce in MongoDB
Apache Hadoop
Hadoop MapReduce
Summary
A. Java Tools
The command line
Java
NetBeans
MySQL
MySQL Workbench
Accessing the MySQL database from NetBeans
The Apache Commons Math Library
The javax JSON Library
The Weka libraries
MongoDB
Index

Content preview from Java Data Analysis

Chapter 8. Cluster Analysis

A clustering algorithm is one that identifies groups of data points according to their proximity to each other. These algorithms are similar to classification algorithms in that they also partition a dataset into subsets of similar points. But, in classification, we already have data whose classes have been identified. such as sweet fruit. In clustering, we seek to discover the unknown groups themselves.

Measuring distances

A metric on a set S of points is a function that satisfies these conditions for all x,y,z ∈ S:

d(p,q) = 0 ⇔ p=q
d(p,q) = d(p,q)
d(p,q) ≤ d(p,r)+d(r,q)

Normally, we think of the number d(p,q) as the distance ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781787285651

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Java Data Analysis

by John R. Hubbard

Chapter 8. Cluster Analysis

Measuring distances

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.