book

R: Data Analysis and Visualization

Name: R: Data Analysis and Visualization
ISBN: 9781786463500

by Tony Fischetti, Brett Lantz, Jaynal Abedin, Hrishi V. Mittal, Bater Makhabel, Edina Berlinger, Ferenc Illés, Milán Badics, Ádám Banai, Gergely Daróczi, Barbara Dömötör, Gergely Gabler, Dániel Havran, Péter Juhász, István Margitai, Balázs Márkus, Péter Medvegyev, Julia Molnár, Balázs Árpád Szucs, Ágnes Tuza, Tamás Vadász, Kata Váradi, Ágnes Vidovics-Dancs

June 2016

Beginner to intermediate

1783 pages

71h 22m

English

Packt Publishing

Read now

Unlock full access

R: Data Analysis and Visualization
Table of Contents
R: Data Analysis and Visualization
Meet Your Course Guide
Course Structure
Course journey
The Course Roadmap and Timeline
I. Module 1: Data Analysis with R
1. RefresheR
Navigating the basicsArithmetic and assignmentLogicals and charactersFlow of control
Getting help in R
Vectors
SubsettingVectorized functionsAdvanced subsettingRecycling

Functions
Matrices
Loading data into R
Working with packages
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
CovarianceCorrelation coefficientsComparing multiple correlations
Visualization methods
Categorical and continuous variablesTwo categorical variablesTwo continuous variablesMore than two continuous variables
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
ParametersThe binomial distribution
The normal distribution
The three-sigma rule and using z-tables
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
6. Testing Hypotheses
Null Hypothesis Significance TestingOne and two-tailed testsWhen things go wrongA warning about significanceA warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validationStriking a balance
Linear regression diagnostics
Second Anscombe relationshipThird Anscombe relationshipFourth Anscombe relationship
Advanced topics
9. Predicting Categorical Variables
k-Nearest NeighborsUsing k-NN in RConfusion matricesLimitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundaryThe diagonal decision boundaryThe crescent decision boundaryThe circular decision boundary
10. Sources of Data
Relational DatabasesWhy didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
11. Dealing with Messy Data
Analysis with missing dataVisualizing missing dataTypes of missing dataSo which one is it?Unsophisticated methods for dealing with missing dataComplete case analysisPairwise deletionMean substitutionHot deck imputationRegression imputationStochastic regression imputationMultiple imputationSo how does mice come up with the imputed values?Methods of imputationMultiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds dataChecking the data type of a columnChecking for unexpected categoriesChecking for outliers, entry errors, or unlikely data pointsChaining assertions
Other messiness
OpenRefineRegular expressionstidyr
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memoryVectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel RAn example of (some) substance
Using Rcpp
Be smarter about your code
13. Reproducibility and Best Practices
R ScriptingRStudioRunning R scriptsAn example scriptScripting and reproducibility
R projects
Version control
Communicating results
II. Module 2: R Graphs
1. R Graphics
Base graphics using the default package
Trellis graphs using lattice
Graphs inspired by Grammar of Graphics
2. Basic Graph Functions
Introduction
Creating basic scatter plots
Getting readyHow to do it...How it works...There's more...A note on R's built-in datasetsSee also
Creating line graphs
Getting readyHow to do it...How it works...There's more...See also
Creating bar charts
Getting readyHow to do it...How it works...There's more...See also
Creating histograms and density plots
How to do it...How it works...There's more...See also
Creating box plots
Getting readyHow to do it...How it works...There's more...See also
Adjusting x and y axes' limits
How to do it...How it works...There's more...See also
Creating heat maps
How to do it...How it works...There's more...See also
Creating pairs plots
How to do it...How it works...There's more...See also
Creating multiple plot matrix layouts
How to do it...How it works...There's more...See also
Adding and formatting legends
Getting readyHow to do it...How it works...There's more...See also
Creating graphs with maps
Getting readyHow to do it...How it works...There's more...See also
Saving and exporting graphs
How to do it...How it works...There's more...See also
3. Beyond the Basics – Adjusting Key Parameters
Introduction
Setting colors of points, lines, and bars
Getting readyHow to do it...How it works...There's more...See also
Setting plot background colors
Getting readyHow to do it...How it works...There's more...
Setting colors for text elements – axis annotations, labels, plot titles, and legends
Getting readyHow to do it...How it works...There's more...
Choosing color combinations and palettes
Getting readyHow to do it...How it works...There's more...See also
Setting fonts for annotations and titles
Getting readyHow to do it...How it works...There's more...See also
Choosing plotting point symbol styles and sizes
Getting readyHow to do it...How it works...There's more...See also
Choosing line styles and width
Getting readyHow to do it...How it works...See also
Choosing box styles
Getting readyHow to do it...How it works...There's more...
Adjusting axis annotations and tick marks
Getting readyHow to do it...How it works...There's more...See also
Formatting log axes
Getting readyHow to do it...How it works...There's more...
Setting graph margins and dimensions
Getting readyHow to do it...How it works...See also
4. Creating Scatter Plots
Introduction
Grouping data points within a scatter plot
Getting readyHow to do it...How it works...There's more...See also
Highlighting grouped data points by size and symbol type
Getting readyHow to do it...How it works...
Labeling data points
Getting readyHow to do it...How it works...There's more...
Correlation matrix using pairs plots
Getting readyHow to do it...How it works...
Adding error bars
Getting readyHow to do it...How it works...There's more...
Using jitter to distinguish closely packed data points
Getting readyHow to do it...How it works...
Adding linear model lines
Getting readyHow to do it...How it works...
Adding nonlinear model curves
Getting readyHow to do it...How it works...
Adding nonparametric model curves with lowess
Getting readyHow to do it...How it works...
Creating three-dimensional scatter plots
Getting readyHow to do it...How it works...There's more...
Creating Quantile-Quantile plots
Getting readyHow to do it...How it works...There's more...
Displaying the data density on axes
Getting readyHow to do it...How it works...There's more...
Creating scatter plots with a smoothed density representation
Getting readyHow to do it...How it works...There's more...
5. Creating Line Graphs and Time Series Charts
Introduction
Adding customized legends for multiple-line graphs
Getting readyHow to do it...How it works...There's more...See also
Using margin labels instead of legends for multiple-line graphs
Getting readyHow to do it...How it works...There's more...
Adding horizontal and vertical grid lines
Getting readyHow to do it...How it works...There's more...See also
Adding marker lines at specific x and y values using abline
Getting readyHow to do it...How it works...There's more...
Creating sparklines
Getting readyHow to do it...How it works...
Plotting functions of a variable in a dataset
Getting readyHow to do it...How it works...There's more...
Formatting time series data for plotting
Getting readyHow to do it...How it works...There's more...
Plotting the date or time variable on the x axis
Getting readyHow to do it...How it works...There's more...
Annotating axis labels in different human-readable time formats
Getting readyHow to do it...How it works...There's more...
Adding vertical markers to indicate specific time events
Getting readyHow to do it...How it works...There's more...
Plotting data with varying time-averaging periods
Getting readyHow to do it...How it works...
Creating stock charts
Getting readyHow to do it...How it works...There's more...
6. Creating Bar, Dot, and Pie Charts
Introduction
Creating bar charts with more than one factor variable
Getting readyHow to do it...How it works...See also
Creating stacked bar charts
Getting readyHow to do it...How it works...There's more...
Adjusting the orientation of bars – horizontal and vertical
Getting readyHow to do it...How it works...There's more...
Adjusting bar widths, spacing, colors, and borders
Getting readyHow to do it...How it works...There's more...
Displaying values on top of or next to the bars
Getting readyHow to do it...How it works...There's more...See also
Placing labels inside bars
Getting readyHow to do it...How it works...There's more...
Creating bar charts with vertical error bars
Getting readyHow to do it...How it works...There's more...
Modifying dot charts by grouping variables
Getting readyHow to do it...How it works...
Making better, readable pie charts with clockwise-ordered slices
Getting readyHow to do it...How it works...See also
Labeling a pie chart with percentage values for each slice
Getting readyHow it works...There's more...See also
Adding a legend to a pie chart
Getting readyHow to do it...How it works...There's more...
7. Creating Histograms
Introduction
Visualizing distributions as count frequencies or probability densities
Getting readyHow to do it...How it works...There's more
Setting the bin size and the number of breaks
Getting readyHow to do it...How it works...There's more
Adjusting histogram styles – bar colors, borders, and axes
Getting readyHow to do it...How it works...There's more
Overlaying a density line over a histogram
Getting readyHow to do it...How it works...
Multiple histograms along the diagonal of a pairs plot
Getting readyHow to do it...How it works...
Histograms in the margins of line and scatter plots
Getting readyHow to do it...How it works...
8. Box and Whisker Plots
Introduction
Creating box plots with narrow boxes for a small number of variables
Getting readyHow to do it...How it works...There's moreSee also
Grouping over a variable
Getting readyHow to do it...How it works...There's moreSee also
Varying box widths by the number of observations
Getting readyHow to do it...How it works...
Creating box plots with notches
Getting readyHow to do it...How it works...There's more
Including or excluding outliers
Getting readyHow to do it...How it works...See also
Creating horizontal box plots
Getting readyHow to do it...How it works...
Changing the box styling
Getting readyHow to do it...How it works...There's more
Adjusting the extent of plot whiskers outside the box
Getting readyHow to do it...How it works...There's more
Showing the number of observations
Getting readyHow to do it...How it works...There's more
Splitting a variable at arbitrary values into subsets
Getting readyHow to do it...How it works...There's more
9. Creating Heat Maps and Contour Plots
Introduction
Creating heat maps of a single Z variable with a scale
Getting readyHow to do it...How it works...There's moreSee also
Creating correlation heat maps
Getting readyHow to do it...How it works...There's more
Summarizing multivariate data in a single heat map
Getting readyHow to do it...How it works...There's more
Creating contour plots
Getting readyHow to do it...How it works...There's moreSee also
Creating filled contour plots
Getting readyHow to do it...How it works...There's moreSee also
Creating three-dimensional surface plots
Getting readyHow to do it...How it works...There's more
Visualizing time series as calendar heat maps
Getting readyHow to do it...How it works...There's more
10. Creating Maps
Introduction
Plotting global data by countries on a world map
Getting readyHow to do it...How it works...There's moreSee also
Creating graphs with regional maps
Getting readyHow to do it...How it works...There's more
Plotting data on Google maps
Getting readyHow to do it...How it works...There's moreSee also
Creating and reading KML data
Getting readyHow to do it...How it works...See Also
Working with ESRI shapefiles
Getting readyHow to do it...How it works...There's more
11. Data Visualization Using Lattice
Introduction
Creating bar charts
Getting readyHow to do it…How it works…There's more…See also
Creating stacked bar charts
Getting readyHow to do it…How it works…There's more…See also
Creating bar charts to visualize cross-tabulation
Getting readyHow to do it…How it works…There's more…
Creating a conditional histogram
Getting readyHow to do it…How it works…There's more…See also
Visualizing distributions through a kernel-density plot
Getting readyHow to do it…How it works…There's more…
Creating a normal Q-Q plot
Getting readyHow to do it…How it works…There's more…
Visualizing an empirical Cumulative Distribution Function
Getting readyHow to do it…How it works…There's more…
Creating a boxplot
Getting readyHow to do it…How it works…There's more…
Creating a conditional scatter plot
Getting readyHow to do it…How it works…There's more…
12. Data Visualization Using ggplot2
Introduction
Creating bar charts
Getting readyHow to do it…How it works…There's more…See also
Creating multiple bar charts
Getting readyHow to do it…How it works…There's more…See also
Creating a bar chart with error bars
Getting readyHow to do it…How it works…There's more…
Visualizing the density of a numeric variable
Getting readyHow to do it...How it works…There's more...
Creating a box plot
Getting readyHow to do it...How it works…
Creating a layered plot with a scatter plot and fitted line
Getting readyHow to do it...How it works…There's more...
Creating a line chart
Getting readyHow to do it...How it works…There's more...
Graph annotation with ggplot
Getting readyHow to do it...How it works...
13. Inspecting Large Datasets
Introduction
Multivariate continuous data visualization
Getting readyHow to do it…How it works…There's more…See also
Multivariate categorical data visualization
Getting readyHow to do it…How it works…There's more…
Visualizing mixed data
Getting readyHow to do it…
Zooming and filtering
Getting readyHow to do it...How it works…There's more...
14. Three-dimensional Visualizations
Introduction
Three-dimensional scatter plots
Getting readyHow to do it…How it works…There's more…See also...
Three-dimensional scatter plots with a regression plane
Getting readyHow to do it…How it works…There's more…
Three-dimensional bar charts
Getting readyHow to do it…How it works…
Three-dimensional density plots
Getting readyHow to do it...How it works…
15. Finalizing Graphs for Publications and Presentations
Introduction
Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF
Getting readyHow to do it...How it works...There's moreSee also
Exporting graphs in vector formats – SVG, PDF, and PS
Getting readyHow to do it...How it works...There's more
Adding mathematical and scientific notations (typesetting)
Getting readyHow to do it...How it works...There's more
Adding text descriptions to graphs
Getting readyHow to do it...How it works...There's more
Using graph templates
Getting readyHow to do it...How it works...There's more
Choosing font families and styles under Windows, Mac OS X, and Linux
Getting readyHow to do it...How it works...There's moreSee also
Choosing fonts for PostScripts and PDFs
Getting readyHow to do it...How it works...There's more
III. Module 3: Learning Data Mining with R
1. Warming Up
Big dataScalability and efficiency
Data source
Data mining
Feature extractionSummarizationThe data mining processCRISP-DMSEMMA
Social network mining
Social network
Text mining
Information retrieval and text miningMining text for prediction
Web data mining
Why R?
What are the disadvantages of R?
Statistics
Statistics and data miningStatistics and machine learningStatistics and RThe limitations of statistics on data mining
Machine learning
Approaches to machine learningMachine learning architecture
Data attributes and description
Numeric attributesCategorical attributesData descriptionData measuring
Data cleaning
Missing valuesJunk, noisy data, or outlier
Data integration
Data dimension reduction
Eigenvalues and EigenvectorsPrincipal-Component AnalysisSingular-value decompositionCUR decomposition
Data transformation and discretization
Data transformationNormalization data transformation methodsData discretization
Visualization of results
Visualization with R
2. Mining Frequent Patterns, Associations, and Correlations
An overview of associations and patternsPatterns and pattern discoveryThe frequent itemsetThe frequent subsequenceThe frequent substructuresRelationship or rules discoveryAssociation rulesCorrelation rules
Market basket analysis
The market basket modelA-Priori algorithmsInput data characteristics and data structureThe A-Priori algorithmThe R implementationA-Priori algorithm variantsThe Eclat algorithmThe R implementationThe FP-growth algorithmInput data characteristics and data structureThe FP-growth algorithmThe R implementationThe GenMax algorithm with maximal frequent itemsetsThe R implementationThe Charm algorithm with closed frequent itemsetsThe R implementationThe algorithm to generate association rulesThe R implementation
Hybrid association rules mining
Mining multilevel and multidimensional association rulesConstraint-based frequent pattern mining
Mining sequence dataset
Sequence datasetThe GSP algorithm
The R implementation
The SPADE algorithmThe R implementationRule generation from sequential patterns
High-performance algorithms
3. Classification
Classification
Generic decision tree induction
Attribute selection measuresTree pruningGeneral algorithm for the decision tree generationThe R implementation
High-value credit card customers classification using ID3
The ID3 algorithmThe R implementationWeb attack detectionHigh-value credit card customers classification
Web spam detection using C4.5
The C4.5 algorithmThe R implementationA parallel version with MapReduceWeb spam detection
Web key resource page judgment using CART
The CART algorithmThe R implementationWeb key resource page judgment
Trojan traffic identification method and Bayes classification
EstimatingPrior probability estimationLikelihood estimationThe Bayes classificationThe R implementationTrojan traffic identification method
Identify spam e-mail and Naïve Bayes classification
The Naïve Bayes classificationThe R implementationIdentify spam e-mail
Rule-based classification of player types in computer games and rule-based classification
Transformation from decision tree to decision rulesRule-based classificationSequential covering algorithmThe RIPPER algorithmThe R implementationRule-based classification of player types in computer games
4. Advanced Classification
Ensemble (EM) methodsThe bagging algorithmThe boosting and AdaBoost algorithmsThe Random forests algorithmThe R implementationParallel version with MapReduce
Biological traits and the Bayesian belief network
The Bayesian belief network (BBN) algorithmThe R implementationBiological traits
Protein classification and the k-Nearest Neighbors algorithm
The kNN algorithmThe R implementation
Document retrieval and Support Vector Machine
The SVM algorithmThe R implementationParallel version with MapReduceDocument retrieval
Classification using frequent patterns
The associative classificationCBADiscriminative frequent pattern-based classificationThe R implementationText classification using sentential frequent itemsets
Classification using the backpropagation algorithm
The BP algorithmThe R implementationParallel version with MapReduce
5. Cluster Analysis
Search engines and the k-means algorithmThe k-means clustering algorithmThe kernel k-means algorithmThe k-modes algorithmThe R implementationParallel version with MapReduceSearch engine and web page clustering
Automatic abstraction of document texts and the k-medoids algorithm
The PAM algorithmThe R implementationAutomatic abstraction and summarization of document text
The CLARA algorithm
The CLARA algorithmThe R implementation
CLARANS
The CLARANS algorithmThe R implementation
Unsupervised image categorization and affinity propagation clustering
Affinity propagation clusteringThe R implementationUnsupervised image categorizationThe spectral clustering algorithmThe R implementation
News categorization and hierarchical clustering
Agglomerative hierarchical clusteringThe BIRCH algorithmThe chameleon algorithmThe Bayesian hierarchical clustering algorithmThe probabilistic hierarchical clustering algorithmThe R implementationNews categorization
6. Advanced Cluster Analysis
Customer categorization analysis of e-commerce and DBSCANThe DBSCAN algorithmCustomer categorization analysis of e-commerce
Clustering web pages and OPTICS
The OPTICS algorithmThe R implementationClustering web pages
Visitor analysis in the browser cache and DENCLUE
The DENCLUE algorithmThe R implementationVisitor analysis in the browser cache
Recommendation system and STING
The STING algorithmThe R implementationRecommendation systems
Web sentiment analysis and CLIQUE
The CLIQUE algorithmThe R implementationWeb sentiment analysis
Opinion mining and WAVE clustering
The WAVE cluster algorithmThe R implementationOpinion mining
User search intent and the EM algorithm
The EM algorithmThe R implementationThe user search intent
Customer purchase data analysis and clustering high-dimensional data
The MAFIA algorithmThe SURFING algorithmThe R implementationCustomer purchase data analysis
SNS and clustering graph and network data
The SCAN algorithmThe R implementationSocial networking service (SNS)
7. Outlier Detection
Credit card fraud detection and statistical methodsThe likelihood-based outlier detection algorithmThe R implementationCredit card fraud detection
Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
The NL algorithmThe FindAllOutsM algorithmThe FindAllOutsD algorithmThe distance-based algorithmThe Dolphin algorithmThe R implementationActivity monitoring and the detection of mobile fraud
Intrusion detection and density-based methods
The OPTICS-OF algorithmThe High Contrast Subspace algorithmThe R implementationIntrusion detection
Intrusion detection and clustering-based methods
Hierarchical clustering to detect outliersThe k-means-based algorithmThe ODIN algorithmThe R implementation
Monitoring the performance of the web server and classification-based methods
The OCSVM algorithmThe one-class nearest neighbor algorithmThe R implementationMonitoring the performance of the web server
Detecting novelty in text, topic detection, and mining contextual outliers
The conditional anomaly detection (CAD) algorithmThe R implementationDetecting novelty in text and topic detection
Collective outliers on spatial data
The route outlier detection (ROD) algorithmThe R implementationCharacteristics of collective outliers
Outlier detection in high-dimensional data
The brute-force algorithmThe HilOut algorithmThe R implementation
8. Mining Stream, Time-series, and Sequence Data
The credit card transaction flow and STREAM algorithmThe STREAM algorithmThe single-pass-any-time clustering algorithmThe R implementationThe credit card transaction flow
Predicting future prices and time-series analysis
The ARIMA algorithmPredicting future prices
Stock market data and time-series clustering and classification
The hError algorithmTime-series classification with the 1NN classifierThe R implementationStock market data
Web click streams and mining symbolic sequences
The TECNO-STREAMS algorithmThe R implementationWeb click streams
Mining sequence patterns in transactional databases
The PrefixSpan algorithmThe R implementation
9. Graph Mining and Network Analysis
Graph miningGraphGraph mining algorithms
Mining frequent subgraph patterns
The gPLS algorithmThe GraphSig algorithmThe gSpan algorithmRightmost path extensions and their supportsThe subgraph isomorphism enumeration algorithmThe canonical checking algorithmThe R implementation
Social network mining
Community detection and the shingling algorithmThe node classification and iterative classification algorithmsThe R implementation
10. Mining Text and Web Data
Text mining and TM packages
Text summarization
Topic representationThe multidocument summarization algorithmThe Maximal Marginal Relevance algorithmThe R implementation
The question answering system
Genre categorization of web pages
Categorizing newspaper articles and newswires into topics
The N-gram-based text categorizationThe R implementation
Web usage mining with web logs
The FCA-based association rule mining algorithmThe R implementation
IV. Module 4: Mastering R for Quantitative Finance
1. Time Series Analysis
Multivariate time series analysisCointegrationVector autoregressive modelsVAR implementation exampleCointegrated VAR and VECM
Volatility modeling
GARCH modeling with the rugarch packageThe standard GARCH modelThe Exponential GARCH model (EGARCH)The Threshold GARCH model (TGARCH)Simulation and forecasting
References and reading list
2. Factor Models
Arbitrage pricing theoryImplementation of APTFama-French three-factor model
Modeling in R
Data selectionEstimation of APT with principal component analysisEstimation of the Fama-French model
References
3. Forecasting Volume
Motivation
The intensity of trading
The volume forecasting model
Implementation in R
The dataLoading the dataThe seasonal componentAR(1) estimation and forecastingSETAR estimation and forecastingInterpreting the resultsReferences
4. Big Data – Advanced Analytics
Getting data from open sources
Introduction to big data analysis in R
K-means clustering on big data
Loading big matricesBig data K-means clustering analysis
Big data linear regression analysis
Loading big dataFitting a linear regression model on large datasets
References
5. FX Derivatives
Terminology and notations
Currency options
Exchange options
Two-dimensional Wiener processesThe Margrabe formulaApplication in R
Quanto options
Pricing formula for a call quantoPricing a call quanto in R
References
6. Interest Rate Derivatives and Models
The Black modelPricing a cap with Black's model
The Vasicek model
The Cox-Ingersoll-Ross model
Parameter estimation of interest rate models
Using the SMFI5 package
References
7. Exotic Options
A general pricing approach
The role of dynamic hedging
How R can help a lot
A glance beyond vanillas
Greeks – the link back to the vanilla world
Pricing the Double-no-touch option
Another way to price the Double-no-touch option
The life of a Double-no-touch option – a simulation
Exotic options embedded in structured products
References
8. Optimal Hedging
Hedging of derivativesMarket risk of derivativesStatic delta hedgeDynamic delta hedgeComparing the performance of delta hedging
Hedging in the presence of transaction costs
Optimization of the hedgeOptimal hedging in the case of absolute transaction costsOptimal hedging in the case of relative transaction costs
Further extensions
References
9. Fundamental Analysis
The basics of fundamental analysis
Collecting data
Revealing connections
Including multiple variables
Separating investment targets
Setting classification rules
Backtesting
Industry-specific investment
References
10. Technical Analysis, Neural Networks, and Logoptimal Portfolios
Market efficiency
Technical analysis
The TA toolkitMarketsPlotting charts - bitcoinBuilt-in indicatorsSMA and EMARSIMACDCandle patterns: key reversalEvaluating the signals and managing the positionA word on money managementWraping up
Neural networks
Forecasting bitcoin pricesEvaluation of the strategy
Logoptimal portfolios
A universally consistent, non-parametric investment strategyEvaluation of the strategy
References
11. Asset and Liability Management
Data preparationData source at first glanceCash-flow generator functionsPreparing the cash-flow
Interest rate risk measurement
Liquidity risk measurement
Modeling non-maturity deposits
A Model of deposit interest rate developmentStatic replication of non-maturity deposits
References
12. Capital Adequacy
Principles of the Basel AccordsBasel IBasel IIMinimum capital requirementsSupervisory reviewTransparencyBasel III
Risk measures
Analytical VaRHistorical VaRMonte-Carlo simulation
Risk categories
Market riskCredit riskOperational risk
References
13. Systemic Risks
Systemic risk in a nutshell
The dataset used in our examples
Core-periphery decomposition
Implementation in RResults
The simulation method
The simulationImplementation in RResults
Possible interpretations and suggestions
References
V. Module 5: Machine Learning with R module
1. Introducing Machine Learning
The origins of machine learning
Uses and abuses of machine learning
Machine learning successesThe limits of machine learningMachine learning ethics
How machines learn
Data storageAbstractionGeneralizationEvaluation
Machine learning in practice
Types of input dataTypes of machine learning algorithmsMatching input data to algorithms
Machine learning with R
Installing R packagesLoading and unloading R packages
2. Managing and Understanding Data
R data structuresVectorsFactorsListsData framesMatrixes and arrays
Managing data with R
Saving, loading, and removing R data structuresImporting and saving data from CSV files
Exploring and understanding data
Exploring the structure of dataExploring numeric variablesMeasuring the central tendency – mean and medianMeasuring spread – quartiles and the five-number summaryVisualizing numeric variables – boxplotsVisualizing numeric variables – histogramsUnderstanding numeric data – uniform and normal distributionsMeasuring spread – variance and standard deviationExploring categorical variablesMeasuring the central tendency – the modeExploring relationships between variablesVisualizing relationships – scatterplotsExamining relationships – two-way cross-tabulations
3. Lazy Learning – Classification Using Nearest Neighbors
Understanding nearest neighbor classificationThe k-NN algorithmMeasuring similarity with distanceChoosing an appropriate kPreparing data for use with k-NNWhy is the k-NN algorithm lazy?
Example – diagnosing breast cancer with the k-NN algorithm
Step 1 – collecting dataStep 2 – exploring and preparing the dataTransformation – normalizing numeric dataData preparation – creating training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceTransformation – z-score standardizationTesting alternative values of k
4. Probabilistic Learning – Classification Using Naive Bayes
Understanding Naive BayesBasic concepts of Bayesian methodsUnderstanding probabilityUnderstanding joint probabilityComputing conditional probability with Bayes' theoremThe Naive Bayes algorithmClassification with Naive BayesThe Laplace estimatorUsing numeric features with Naive Bayes
Example – filtering mobile phone spam with the Naive Bayes algorithm
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – cleaning and standardizing text dataData preparation – splitting text documents into wordsData preparation – creating training and test datasetsVisualizing text data – word cloudsData preparation – creating indicator features for frequent wordsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
5. Divide and Conquer – Classification Using Decision Trees and Rules
Understanding decision treesDivide and conquerThe C5.0 decision tree algorithmChoosing the best splitPruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating random training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceBoosting the accuracy of decision treesMaking mistakes more costlier than others
Understanding classification rules
Separate and conquerThe 1R algorithmThe RIPPER algorithmRules from decision treesWhat makes trees and rules greedy?
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
6. Forecasting Numeric Data – Regression Methods
Understanding regressionSimple linear regressionOrdinary least squares estimationCorrelationsMultiple linear regression
Example – predicting medical expenses using linear regression
Step 1 – collecting dataStep 2 – exploring and preparing the dataExploring relationships among features – the correlation matrixVisualizing relationships among features – the scatterplot matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceModel specification – adding non-linear relationshipsTransformation – converting a numeric variable to a binary indicatorModel specification – adding interaction effectsPutting it all together – an improved regression model
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataVisualizing decision treesStep 4 – evaluating model performanceMeasuring performance with the mean absolute errorStep 5 – improving model performance
7. Black Box Methods – Neural Networks and Support Vector Machines
Understanding neural networksFrom biological to artificial neuronsActivation functionsNetwork topologyThe number of layersThe direction of information travelThe number of nodes in each layerTraining neural networks with backpropagation
Example – Modeling the strength of concrete with ANNs
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
Understanding Support Vector Machines
Classification with hyperplanesThe case of linearly separable dataThe case of nonlinearly separable dataUsing kernels for non-linear spaces
Example – performing OCR with SVMs
Step 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
8. Finding Patterns – Market Basket Analysis Using Association Rules
Understanding association rulesThe Apriori algorithm for association rule learningMeasuring rule interest – support and confidenceBuilding a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating a sparse matrix for transaction dataVisualizing item support – item frequency plotsVisualizing the transaction data – plotting the sparse matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSorting the set of association rulesTaking subsets of association rulesSaving association rules to a file or data frame
9. Finding Groups of Data – Clustering with k-means
Understanding clusteringClustering as a machine learning taskThe k-means clustering algorithmUsing distance to assign and update clustersChoosing the appropriate number of clusters
Example – finding teen market segments using k-means clustering
Step 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – dummy coding missing valuesData preparation – imputing the missing valuesStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performance
10. Evaluating Model Performance
Measuring performance for classificationWorking with classification prediction data in RA closer look at confusion matricesUsing confusion matrices to measure performanceBeyond accuracy – other measures of performanceThe kappa statisticSensitivity and specificityPrecision and recallThe F-measureVisualizing performance trade-offsROC curves
Estimating future performance
The holdout methodCross-validationBootstrap sampling
11. Improving Model Performance
Tuning stock models for better performanceUsing caret for automated parameter tuningCreating a simple tuned modelCustomizing the tuning process
Improving model performance with meta-learning
Understanding ensemblesBaggingBoostingRandom forestsTraining random forestsEvaluating random forest performance
12. Specialized Machine Learning Topics
Working with proprietary files and databasesReading from and writing to Microsoft Excel, SAS, SPSS, and Stata filesQuerying data in SQL databases
Working with online data and services
Downloading the complete text of web pagesScraping data from web pagesParsing XML documentsParsing JSON from web APIs
Working with domain-specific data
Analyzing bioinformatics dataAnalyzing and visualizing network data
Improving the performance of R
Managing very large datasetsGeneralizing tabular data structures with dplyrMaking data frames faster with data.tableCreating disk-based data frames with ffUsing massive matrices with bigmemoryLearning faster with parallel computingMeasuring execution timeWorking in parallel with multicore and snowTaking advantage of parallel with foreach and doParallelParallel cloud computing with MapReduce and HadoopGPU computingDeploying optimized learning algorithmsBuilding bigger regression models with biglmGrowing bigger and faster random forests with bigrfTraining and evaluating models in parallel with caret
A. Reflect and Test Yourself Answers
Module 1: Data Analysis with RChapter 1: RefresheRChapter 2: The Shape of DataChapter 3: Describing RelationshipsChapter 4: ProbabilityChapter 5: Using Data to Reason About the WorldChapter 6: Testing HypothesesChapter 7: Bayesian MethodsChapter 8: Predicting Continuous VariablesChapter 9: Predicting Categorical VariablesChapter 10: Sources of DataChapter 11: Dealing with Messy DataChapter 12: Dealing with Large Data
Module 2: R Graphs
Chapter 1: R GraphicsChapter 2: Basic Graph FunctionsChapter 3: Beyond the Basics – Adjusting Key ParametersChapter 4: Creating Scatter PlotsChapter 5: Creating Line Graphs and Time Series ChartsChapter 6: Creating Bar, Dot, and Pie ChartsChapter 7: Creating HistogramsChapter 8: Box and Whisker PlotsChapter 9: Creating Heat Maps and Contour Plots
Module 4: Mastering R for Quantitative Finance
Chapter 1: Time Series AnalysisChapter 3: Forecasting VolumeChapter 4: Big Data – Advanced AnalyticsChapter 5: FX DerivativesChapter 6: Interest Rate Derivatives and ModelsChapter 7: Exotic OptionsChapter 8: Optimal HedgingChapter 9: Fundamental Analysis
Module 5: Machine Learning with R
Chapter 1: Introducing Machine LearningChapter 2: Managing and Understanding DataChapter 3: Lazy Learning – Classification Using Nearest NeighborsChapter 4: Probabilistic Learning – Classification Using Naive BayesChapter 5: Divide and Conquer – Classification Using Decision Trees and RulesChapter 6: Forecasting Numeric Data – Regression MethodsChapter 7: Black Box Methods – Neural Networks and Support Vector MachinesChapter 8: Finding Patterns – Market Basket Analysis Using Association Rules
B. Bibliography
Index

Content preview from R: Data Analysis and Visualization

Chapter 11. Dealing with Messy Data

As mentioned in the last chapter, analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets.

What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are:

Many missing values ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781786463500

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

R: Data Analysis and Visualization

Chapter 11. Dealing with Messy Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Graphical Data Analysis with R

R: Recipes for Analysis, Visualization and Machine Learning

Advanced R Statistical Programming and Data Models: Analysis, Machine Learning, and Visualization

R Data Analysis Projects

Publisher Resources