book

Clojure for Data Science

Name: Clojure for Data Science
Author: Henry Garner
ISBN: 9781784397180

by Henry Garner

September 2015

Beginner to intermediate

608 pages

13h 43m

English

Packt Publishing

Read now

Unlock full access

Clojure for Data Science
Table of Contents
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book

Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. Statistics
Downloading the sample code
Running the examples
Downloading the data
Inspecting the data
Data scrubbing
Descriptive statistics
The meanInterpreting mathematical notationThe median
Variance
Quantiles
Binning data
Histograms
The normal distribution
The central limit theorem
Poincaré's baker
Generating distributions
Skewness
Quantile-quantile plots
Comparative visualizations
Box plotsCumulative distribution functions
The importance of visualizations
Visualizing electorate data
Adding columns
Adding derived columns
Comparative visualizations of electorate data
Visualizing the Russian election data
Comparative visualizations
Probability mass functionsScatter plotsScatter transparency
Summary
2. Inference
Introducing AcmeContent
Download the sample code
Load and inspect the data
Visualizing the dwell times
The exponential distribution
The distribution of daily means
The central limit theorem
Standard error
Samples and populations
Confidence intervals
Sample comparisonsBias
Visualizing different populations
Hypothesis testing
Significance
Testing a new site design
Performing a z-testStudent's t-distributionDegrees of freedom
The t-statistic
Performing the t-test
Two-tailed tests
One-sample t-test
Resampling
Testing multiple designs
Calculating sample means
Multiple comparisons
Introducing the simulationCompile the simulation
The browser simulation
jStat
B1
Scalable Vector Graphics
Plotting probability densities
State and Reagent
Updating stateBinding the interface
Simulating multiple tests
The Bonferroni correction
Analysis of variance
The F-distribution
The F-statistic
The F-test
Effect size
Cohen's d
Summary
3. Correlation
About the data
Inspecting the data
Visualizing the data
The log-normal distribution
Visualizing correlationJittering
Covariance
Pearson's correlation
Sample r and population rho
Hypothesis testing
Confidence intervals
Regression
Linear equationsResiduals
Ordinary least squares
Slope and interceptInterpretationVisualizationAssumptions
Goodness-of-fit and R-square
Multiple linear regression
Matrices
DimensionsVectorsConstructionAddition and scalar multiplicationMatrix-vector multiplicationMatrix-matrix multiplicationTranspositionThe identity matrixInversion
The normal equation
More features
Multiple R-squared
Adjusted R-squared
Incanter's linear modelThe F-test of model significanceCategorical and dummy variablesRelative power
Collinearity
Multicollinearity
Prediction
The confidence interval of a predictionModel scopeThe final model
Summary
4. Classification
About the data
Inspecting the data
Comparisons with relative risk and odds
The standard error of a proportion
Estimation using bootstrapping
The binomial distribution
The standard error of a proportion formula
Significance testing proportions
Adjusting standard errors for large samples
Chi-squared multiple significance testing
Visualizing the categoriesThe chi-squared testThe chi-squared statisticThe chi-squared test
Classification with logistic regression
The sigmoid functionThe logistic regression cost functionParameter optimization with gradient descentGradient descent with IncanterConvexity
Implementing logistic regression with Incanter
Creating a feature matrixEvaluating the logistic regression classifierThe confusion matrixThe kappa statistic
Probability
Bayes theoremBayes theorem with multiple predictors
Naive Bayes classification
Implementing a naive Bayes classifierEvaluating the naive Bayes classifierComparing the logistic regression and naive Bayes approaches
Decision trees
InformationEntropyInformation gainUsing information gain to identify the best predictorRecursively building a decision treeUsing the decision tree for classificationEvaluating the decision tree classifier
Classification with clj-ml
Loading data with clj-mlBuilding a decision tree in clj-ml
Bias and variance
OverfittingCross-validationAddressing high bias
Ensemble learning and random forests
Bagging and boosting
Saving the classifier to a file
Summary
5. Big Data
Downloading the code and dataInspecting the dataCounting the records
The reducers library
Parallel folds with reducersLoading large files with iotaCreating a reducers processing pipelineCurried reductions with reducersStatistical folds with reducersAssociativityCalculating the mean using foldCalculating the variance using fold
Mathematical folds with Tesser
Calculating covariance with TesserCommutativitySimple linear regression with TesserCalculating a correlation matrix
Multiple regression with gradient descent
The gradient descent update ruleThe gradient descent learning rateFeature scalingFeature extractionCreating a custom Tesser foldCreating a matrix-sum foldCalculating the total model errorCreating a matrix-mean foldApplying a single step of gradient descentRunning iterative gradient descent
Scaling gradient descent with Hadoop
Gradient descent on Hadoop with Tesser and ParkourParkour distributed sources and sinksRunning a feature scale fold with HadoopRunning gradient descent with HadoopPreparing our code for a Hadoop clusterBuilding an uberjarSubmitting the uberjar to Hadoop
Stochastic gradient descent
Stochastic gradient descent with ParkourDefining a mapperParkour shaping functionsDefining a reducerSpecifying Hadoop jobs with Parkour graphChaining mappers and reducers with Parkour graph
Summary
6. Clustering
Downloading the data
Extracting the data
Inspecting the data
Clustering text
Set-of-words and the Jaccard indexTokenizing the Reuters filesApplying the Jaccard index to documentsThe bag-of-words and Euclidean distanceRepresenting text as vectorsCreating a dictionary
Creating term frequency vectors
The vector space model and cosine distanceRemoving stop wordsStemming
Clustering with k-means and Incanter
Clustering the Reuters documents
Better clustering with TF-IDF
Zipf's lawCalculating the TF-IDF weightk-means clustering with TF-IDFBetter clustering with n-grams
Large-scale clustering with Mahout
Converting text documents to a sequence fileUsing Parkour to create Mahout vectorsCreating distributed unique IDsDistributed unique IDs with HadoopSharing data with the distributed cacheBuilding Mahout vectors from input documents
Running k-means clustering with Mahout
Viewing k-means clustering resultsInterpreting the clustered output
Cluster evaluation measures
Inter-cluster densityIntra-cluster densityCalculating the root mean square error with ParkourLoading clustered points and centroidsCalculating the cluster RMSEDetermining optimal k with the elbow methodDetermining optimal k with the Dunn indexDetermining optimal k with the Davies-Bouldin index
The drawbacks of k-means
The Mahalanobis distance measure
The curse of dimensionality
Summary
7. Recommender Systems
Download the code and data
Inspect the data
Parse the data
Types of recommender systems
Collaborative filtering
Item-based and user-based recommenders
Slope One recommenders
Calculating the item differencesMaking recommendationsPractical considerations for user and item recommenders
Building a user-based recommender with Mahout
k-nearest neighbors
Recommender evaluation with Mahout
Evaluating distance measuresThe Pearson correlation similaritySpearman's rank similarityDetermining optimum neighborhood sizeInformation retrieval statisticsPrecisionRecallMahout's information retrieval evaluatorF-measure and the harmonic meanFall-outNormalized discounted cumulative gainPlotting the information retrieval resultsRecommendation with Boolean preferencesImplicit versus explicit feedback
Probabilistic methods for large sets
Testing set membership with Bloom filters
Jaccard similarity for large sets with MinHash
Reducing pair comparisons with locality-sensitive hashingBucketing signatures
Dimensionality reduction
Plotting the Iris datasetPrinciple component analysisSingular value decomposition
Large-scale machine learning with Apache Spark and MLlib
Loading data with SparklingMapping dataDistributed datasets and tuplesFiltering dataPersistence and caching
Machine learning on Spark with MLlib
Movie recommendations with alternating least squaresALS with Spark and MLlibMaking predictions with ALSEvaluating ALSCalculating the sum of squared errors
Summary
8. Network Analysis
Download the dataInspecting the dataVisualizing graphs with Loom
Graph traversal with Loom
The seven bridges of Königsberg
Breadth-first and depth-first search
Finding the shortest path
Minimum spanning treesSubgraphs and connected componentsSCC and the bow-tie structure of the web
Whole-graph analysis
Scale-free networks
Distributed graph computation with GraphX
Creating RDGs with GlitteringMeasuring graph density with triangle countingGraphX partitioning strategiesRunning the built-in triangle counting algorithmImplement triangle counting with GlitteringStep one – collecting neighbor IDsSteps two, three, and four – aggregate messagesStep five – dividing the countsRunning the custom triangle counting algorithmThe Pregel APIConnected components with the Pregel APIStep one – map verticesSteps two and three – the message functionStep four – update the attributesStep five – iterate to convergenceRunning connected componentsCalculating the size of the largest connected componentDetecting communities with label propagationStep one – map verticesStep two – send the vertex attributeStep three – aggregate valueStep four – vertex functionStep five – set the maximum iterations countRunning label propagationMeasuring community influence using PageRankThe flow formulationImplementing PageRank with GlitteringSort by highest influenceRunning PageRank to determine community influencers
Summary
9. Time Series
About the dataLoading the Longley data
Fitting curves with a linear model
Time series decomposition
Inspecting the airline dataVisualizing the airline dataStationarityDe-trending and differencing
Discrete time models
Random walksAutoregressive modelsDetermining autocorrelation in AR modelsMoving-average modelsDetermining autocorrelation in MA modelsCombining the AR and MA modelsCalculating partial autocorrelationAutocovariancePACF with Durbin-Levinson recursionPlotting partial autocorrelationDetermining ARMA model order with ACF and PACFACF and PACF of airline dataRemoving seasonality with differencing
Maximum likelihood estimation
Calculating the likelihoodEstimating the maximum likelihoodNelder-Mead optimization with Apache Commons MathIdentifying better models with Akaike Information Criterion
Time series forecasting
Forecasting with Monte Carlo simulation
Summary
10. Visualization
Download the code and data
Exploratory data visualization
Representing a two-dimensional histogram
Using Quil for visualization
Drawing to the sketch windowQuil's coordinate systemPlotting the gridSpecifying the fill colorColor and fillOutputting an image file
Visualization for communication
Visualizing wealth distributionBringing data to life with QuilDrawing bars of differing widthsAdding a title and axis labelsImproving the clarity with illustrationsAdding text to the barsIncorporating additional dataDrawing complex shapesDrawing curvesPlotting compound chartsOutput to PDF
Summary
Index

Content preview from Clojure for Data Science

Chapter 5. Big Data

	"More is different."
	--Philip Warren Anderson

In the previous chapters, we've used regression techniques to fit models to the data. In Chapter 3, Correlation, for example, we built a linear model that used ordinary least squares and the normal equation to fit a straight line through the athletes' heights and log weights. In Chapter 4, Classification, we used Incanter's optimize namespace to minimize the logistic cost function and build a classifier of Titanic's passengers. In this chapter, we'll apply similar analysis in a way that's suitable for much larger quantities of data.

We'll be working with a relatively modest dataset of only 100,000 records. This isn't big data (at 100 MB, it will fit comfortably in the memory of ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781784397180

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Clojure for Data Science

by Henry Garner

Chapter 5. Big Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Learning Clojure

Clojure Programming

Clojure Applied

Clojure Inside Out

Publisher Resources