O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

R Data Mining

Book Description

Mine valuable insights from your data using popular tools and techniques in R

About This Book

  • Understand the basics of data mining and why R is a perfect tool for it.
  • Manipulate your data using popular R packages such as ggplot2, dplyr, and so on to gather valuable business insights from it.
  • Apply effective data mining models to perform regression and classification tasks.

Who This Book Is For

If you are a budding data scientist, or a data analyst with a basic knowledge of R, and want to get into the intricacies of data mining in a practical manner, this is the book for you. No previous experience of data mining is required.

What You Will Learn

  • Master relevant packages such as dplyr, ggplot2 and so on for data mining
  • Learn how to effectively organize a data mining project through the CRISP-DM methodology
  • Implement data cleaning and validation tasks to get your data ready for data mining activities
  • Execute Exploratory Data Analysis both the numerical and the graphical way
  • Develop simple and multiple regression models along with logistic regression
  • Apply basic ensemble learning techniques to join together results from different data mining models
  • Perform text mining analysis from unstructured pdf files and textual data
  • Produce reports to effectively communicate objectives, methods, and insights of your analyses

In Detail

R is widely used to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more. This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R.

It will let you gain these powerful skills while immersing in a one of a kind data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques.

While moving along the plot of the story you will effectively learn and practice on real data the various R packages commonly employed for this kind of tasks. You will also get the chance of apply some of the most popular and effective data mining models and algos, from the basic multiple linear regression to the most advanced Support Vector Machines. Unlike other data mining learning instruments, this book will effectively expose you the theory behind these models, their relevant assumptions and when they can be applied to the data you are facing. By the end of the book you will hold a new and powerful toolbox of instruments, exactly knowing when and how to employ each of them to solve your data mining problems and get the most out of your data.

Finally, to let you maximize the exposure to the concepts described and the learning process, the book comes packed with a reproducible bundle of commented R scripts and a practical set of data mining models cheat sheets.

Style and approach

This book takes a practical, step-by-step approach to explain the concepts of data mining. Practical use-cases involving real-world datasets are used throughout the book to clearly explain theoretical concepts.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Why to Choose R for Your Data Mining and Where to Start
    1. What is R?
    2. A bit of history
    3. R's points of strength
      1. Open source inside
      2. Plugin ready
      3. Data visualization friendly
    4. Installing R and writing R code
      1. Downloading R
        1. R installation for Windows and macOS
        2. R installation for Linux OS
      2. Main components of a base R installation
    5. Possible alternatives to write and run R code
      1. RStudio (all OSs)
      2. The Jupyter Notebook (all OSs)
      3. Visual Studio (Windows users only)
    6. R foundational notions
      1. A preliminary R session
        1. Executing R interactively through the R console
        2. Creating an R script
        3. Executing an R script
      2. Vectors
      3. Lists
        1. Creating lists
        2. Subsetting lists
      4. Data frames
      5. Functions
    7. R's weaknesses and how to overcome them
      1. Learning R effectively and minimizing the effort
        1. The tidyverse
        2. Leveraging the R community to learn R
          1. Where to find the R community
          2. Engaging with the community to learn R
      2. Handling large datasets with R
    8. Further references
    9. Summary
  3. A First Primer on Data Mining Analysing Your Bank Account Data
    1. Acquiring and preparing your banking data
      1. Data model
    2. Summarizing your data with pivot-like tables
      1. A gentle introduction to the pipe operator
      2. An even more gentle introduction to the dplyr package
      3. Installing the necessary packages and loading your data into R
        1. Installing and loading the necessary packages
        2. Importing your data into R
      4. Defining the monthly and daily sum of expenses
    3. Visualizing your data with ggplot2
      1. Basic data visualization principles
        1. Less but better 
        2. Not every chart is good for your message
          1. Scatter plot
          2. Line chart
          3. Bar plot
          4. Other advanced charts
        3. Colors have to be chosen carefully
          1. A bit of theory - chromatic circle, hue, and luminosity
      2. Visualizing your data with ggplot
        1. One more gentle introduction – the grammar of graphics
        2. A layered grammar of graphics – ggplot2
        3. Visualizing your banking movements with ggplot2
          1. Visualizing the number of movements per day of the week
    4. Further references
    5. Summary
  4. The Data Mining Process - CRISP-DM Methodology
    1. The Crisp-DM methodology data mining cycle 
    2. Business understanding
    3. Data understanding
      1. Data collection
        1. How to perform data collection with R
          1. Data import from TXT and CSV files
          2. Data import from different types of format already structured as tables
          3. Data import from unstructured sources
      2. Data description
        1. How to perform data description with R
      3. Data exploration
        1. What to use in R to perform this task
          1. The summary() function
          2. Box plot
          3. Histograms
    4. Data preparation
    5. Modelling
      1. Defining a data modeling strategy
        1. How similar problems were solved in the past
          1. Emerging techniques
        2. Classification of modeling problems 
        3. How to perform data modeling with R
    6. Evaluation
      1. Clustering evaluation
      2. Classification evaluation
      3. Regression evaluation
      4. How to judge the adequacy of a model's performance
        1. What to use in R to perform this task
    7. Deployment
      1. Deployment plan development
      2. Maintenance plan development
    8. Summary
  5. Keeping the House Clean – The Data Mining Architecture
    1. A general overview
    2. Data sources
      1. Types of data sources
        1. Unstructured data sources
        2. Structured data sources
        3. Key issues of data sources
    3. Databases and data warehouses
      1. The third wheel – the data mart
      2. One-level database
      3. Two-level database
      4. Three-level database
      5. Technologies
        1. SQL
        2. MongoDB
        3. Hadoop
    4. The data mining engine
      1. The interpreter
      2. The interface between the engine and the data warehouse
      3. The data mining algorithms
    5. User interface
      1. Clarity
        1. Clarity and mystery
        2. Clarity and simplicity
        3. Efficiency
        4. Consistency
          1. Syntax highlight
          2. Auto-completion  
    6. How to build a data mining architecture in R
      1. Data sources
      2. The data warehouse
      3. The data mining engine
        1. The interface between the engine and the data warehouse
        2. The data mining algorithms
      4. The user interface
    7. Further references
    8. Summary
  6. How to Address a Data Mining Problem – Data Cleaning and Validation
    1. On a quiet day
    2. Data cleaning
      1. Tidy data
      2. Analysing the structure of our data
        1. The str function
        2. The describe function
        3. head, tail, and View functions
        4. Evaluating your data tidiness
          1. Every row is a record
          2. Every column shows an attribute
          3. Every table represents an observational unit
      3. Tidying our data
        1. The tidyr package
          1. Long versus wide data
          2. The spread function
          3. The gather function
          4. The separate function
        2. Applying tidyr to our dataset
      4. Validating our data
        1. Fitness for use
        2. Conformance to standards
        3. Data quality controls
          1. Consistency checks
          2. Data type checks
          3. Logical checks
          4. Domain checks
          5. Uniqueness checks
        4. Performing data validation on our data
          1. Data type checks with str()
          2. Domain checks
      5. The final touch — data merging
        1. left_join function
        2. moving beyond left_join
    3. Further references
    4. Summary
  7. Looking into Your Data Eyes – Exploratory Data Analysis
    1. Introducing summary EDA
      1. Describing the population distribution
        1. Quartiles and Median
        2. Mean
          1. The mean and phenomenon going on within sub populations
          2. The mean being biased by outlier values
          3. Computing the mean of our population
        3. Variance
        4. Standard deviation
        5. Skewness
      2. Measuring the relationship between variables
        1. Correlation
          1. The Pearson correlation coefficient
          2. Distance correlation
        2. Weaknesses of summary EDA - the Anscombe quartet
    2. Graphical EDA
      1. Visualizing a variable distribution
        1. Histogram
          1. Reporting date histogram
          2. Geographical area histogram
          3. Cash flow histogram
        2. Boxplot
        3. Checking for outliers
      2. Visualizing relationships between variables
        1. Scatterplots
          1. Adding title, subtitle, and caption to the plot
          2. Setting axis and legend
          3. Adding explicative text to the plot
          4. Final touches on colors
    3. Further references
    4. Summary
  8. Our First Guess – a Linear Regression
    1. Defining a data modelling strategy
      1. Data modelling notions
        1. Supervised learning
        2. Unsupervised learning
        3. The modeling strategy
    2. Applying linear regression to our data
      1. The intuition behind linear regression
      2. The math behind the linear regression
        1. Ordinary least squares technique
        2. Model requirements – what to look for before applying the model
          1. Residuals' uncorrelation
          2. Residuals' homoscedasticity
      3. How to apply linear regression in R
        1. Fitting the linear regression model
        2. Validating model assumption
        3. Visualizing fitted values
          1. Preparing the data for visualization
          2. Developing the data visualization
    3. Further references
    4. Summary
  9. A Gentle Introduction to Model Performance Evaluation
    1. Defining model performance
      1. Fitting versus interpretability
      2. Making predictions with models
    2. Measuring performance in regression models
      1. Mean squared error
      2. R-squared
        1. R-squared meaning and interpretation
        2. R-squared computation in R
        3. Adjusted R-squared
        4. R-squared misconceptions
          1. The R-squared doesn't measure the goodness of fit
          2. A low R-squared doesn't mean your model is not statistically significant
    3. Measuring the performance in classification problems
      1. The confusion matrix
        1. Confusion matrix in R
      2. Accuracy
        1. How to compute accuracy in R
      3. Sensitivity
        1. How to compute sensitivity in R
      4. Specificity
        1. How to compute specificity in R
      5. How to choose the right performance statistics
    4. A final general warning – training versus test datasets
    5. Further references
    6. Summary
  10. Don't Give up – Power up Your Regression Including Multiple Variables
    1. Moving from simple to multiple linear regression
      1. Notation
      2. Assumptions
        1. Variables' collinearity
          1. Tolerance 
          2. Variance inflation factors
          3. Addressing collinearity
    2. Dimensionality reduction
      1. Stepwise regression
        1. Backward stepwise regression
          1. From the full model to the n-1 model
        2. Forward stepwise regression
        3. Double direction stepwise regression
      2. Principal component regression
    3. Fitting a multiple linear model with R
      1. Model fitting
      2. Variable assumptions validation
      3. Residual assumptions validation
      4. Dimensionality reduction
        1. Principal component regression
        2. Stepwise regression
          1. Linear model cheat sheet
    4. Further references
    5. Summary
  11. A Different Outlook to Problems with Classification Models
    1. What is classification and why do we need it?
      1. Linear regression limitations for categorical variables
      2. Common classification algorithms and models
    2. Logistic regression
      1. The intuition behind logistic regression
        1. The logistic function estimates a response variable enclosed within an upper and lower bound
        2. The logistic function estimates the probability of an observation pertaining to one of the two available categories
      2. The math behind logistic regression
        1. Maximum likelihood estimator
        2. Model assumptions
          1. Absence of multicollinearity between variables
          2. Linear relationship between explanatory variables and log odds
          3. Large enough sample size
      3. How to apply logistic regression in R
        1. Fitting the model
          1. Reading the glm() estimation output
          2. The level of statistical significance of the association between the explanatory variable and the response variable
          3. The AIC performance metric
        2. Validating model assumptions
          1. Fitting quadratic and cubic models to test for linearity of log odds
      4. Visualizing and interpreting logistic regression results 
        1. Visualizing results
        2. Interpreting results
          1. Logistic regression cheat sheet
    3. Support vector machines
      1. The intuition behind support vector machines
        1. The hyperplane
        2. Maximal margin classifier
        3. Support vector and support vector machines
        4. Model assumptions
        5. Independent and identically distributed random variables
          1. Independent variables
          2. Identically distributed
      2. Applying support vector machines in R
        1. The svm() function
        2. Applying the svm function to our data
      3. Interpreting support vector machine results
        1. Understanding the meaning of hyperplane weights
          1. Support Vector Machine cheat sheet
    4. References
    5. Summary
  12. The Final Clash – Random Forests and Ensemble Learning
    1. Random forest
      1. Random forest building blocks – decision trees introduction
      2. The intuition behind random forests
      3. How to apply random forests in R
      4. Evaluating the results of the model
        1. Performance of the model
          1. OOB estimate error rate
          2. Confusion matrix
        2. Importance of predictors
          1. Mean decrease in accuracy
          2. Gini index
          3. Plotting relative importance of predictors
          4. Random forest cheat sheet
    2. Ensemble learning
      1. Basic ensemble learning techniques 
      2. Applying ensemble learning to our data in R
        1. The R caret package
        2. Computing a confusion matrix with the caret package
        3. Interpreting confusion matrix results
        4. Applying a weighted majority vote to our data
    3. Applying estimated models on new data
      1. predict.glm() for prediction from the logistic model
      2. predict.randomForest() for prediction from random forests
      3. predict.svm() for prediction from support vector machines
    4. A more structured approach to predictive analytics
    5. Applying the majority vote ensemble technique on predicted data
    6. Further references
    7. Summary
  13. Looking for the Culprit – Text Data Mining with R
    1. Extracting data from a PDF file in R
      1. Getting a list of documents in a folder
      2. Reading PDF files into R via pdf_text()
      3. Iteratively extracting text from a set of documents with a for loop
    2. Sentiment analysis
    3. Developing wordclouds from text
    4. Looking for context in text – analyzing document n-grams
    5. Performing network analysis on textual data
      1. Obtaining an hedge list from a data frame 
      2. Visualizing a network with the ggraph package
        1. Tuning the appearance of nodes and edges
        2. Computing the degree parameter in a network to highlight relevant nodes
    6. Further references
    7. Summary
  14. Sharing Your Stories with Your Stakeholders through R Markdown
    1. Principles of a good data mining report
      1. Clearly state the objectives
      2. Clearly state assumptions 
      3. Make the data treatments clear
      4. Show consistent data
      5. Provide data lineage
    2. Set up an rmarkdown report
    3. Develop an R markdown report in RStudio
      1. A brief introduction to markdown 
      2. Inserting a chunk of code
        1. How to show readable tables in rmarkdwon reports
      3. Reproducing R code output within text through inline code
      4. Introduction to Shiny and the reactivity framework
        1. Employing input and output to deal with changes in Shiny app parameters
      5. Adding an interactive data lineage module
        1. Adding an input panel to an R markdown report
        2. Adding a data table to your report
        3. Expanding Shiny beyond the basics
    4. Rendering and sharing an R markdown report 
      1. Rendering an R markdown report
      2. Sharing an R Markdown report
        1. Render a static markdown report into different file formats
        2. Render interactive Shiny apps on dedicated servers
          1. Sharing a Shiny app through shinyapps.io
    5. Further references
    6. Summary
  15. Epilogue
  16. Dealing with Dates, Relative Paths and Functions
    1. Dealing with dates in R
    2. Working directories and relative paths in R
    3. Conditional statements