O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

R: Mining Spatial, Text, Web, and Social Media Data

Book Description

Create data mining algorithms

About This Book
  • Develop a strong strategy to solve predictive modeling problems using the most popular data mining algorithms
  • Real-world case studies will take you from novice to intermediate to apply data mining techniques
  • Deploy cutting-edge sentiment analysis techniques to real-world social media data using R
Who This Book Is For

This Learning Path is for R developers who are looking to making a career in data analysis or data mining. Those who come across data mining problems of different complexities from web, text, numerical, political, and social media domains will find all information in this single learning path.

What You Will Learn
  • Discover how to manipulate data in R
  • Get to know top classification algorithms written in R
  • Explore solutions written in R based on R Hadoop projects
  • Apply data management skills in handling large data sets
  • Acquire knowledge about neural network concepts and their applications in data mining
  • Create predictive models for classification, prediction, and recommendation
  • Use various libraries on R CRAN for data mining
  • Discover more about data potential, the pitfalls, and inferencial gotchas
  • Gain an insight into the concepts of supervised and unsupervised learning
  • Delve into exploratory data analysis
  • Understand the minute details of sentiment analysis
In Detail

Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining.

You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects.

Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects.

After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

  • Learning Data Mining with R by Bater Makhabel
  • R Data Mining Blueprints by Pradeepta Mishra
  • Social Media Mining with R by Nathan Danneman and Richard Heimann
Style and approach

A complete package with which will take you from the basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. R: Mining Spatial, Text, Web, and Social Media Data
    1. Table of Contents
    2. R: Mining Spatial, Text, Web, and Social Media Data
    3. R: Mining Spatial, Text, Web, and Social Media Data
    4. Credits
    5. Preface
      1. What this learning path covers
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    6. 1. Module 1
      1. 1. Warming Up
        1. Big data
          1. Scalability and efficiency
        2. Data source
        3. Data mining
          1. Feature extraction
          2. Summarization
          3. The data mining process
            1. CRISP-DM
            2. SEMMA
        4. Social network mining
          1. Social network
        5. Text mining
          1. Information retrieval and text mining
          2. Mining text for prediction
        6. Web data mining
        7. Why R?
          1. What are the disadvantages of R?
        8. Statistics
          1. Statistics and data mining
          2. Statistics and machine learning
          3. Statistics and R
          4. The limitations of statistics on data mining
        9. Machine learning
          1. Approaches to machine learning
          2. Machine learning architecture
        10. Data attributes and description
          1. Numeric attributes
          2. Categorical attributes
          3. Data description
          4. Data measuring
        11. Data cleaning
          1. Missing values
          2. Junk, noisy data, or outlier
        12. Data integration
        13. Data dimension reduction
          1. Eigenvalues and Eigenvectors
          2. Principal-Component Analysis
          3. Singular-value decomposition
          4. CUR decomposition
        14. Data transformation and discretization
          1. Data transformation
          2. Normalization data transformation methods
          3. Data discretization
        15. Visualization of results
          1. Visualization with R
        16. Time for action
        17. Summary
      2. 2. Mining Frequent Patterns, Associations, and Correlations
        1. An overview of associations and patterns
          1. Patterns and pattern discovery
            1. The frequent itemset
            2. The frequent subsequence
            3. The frequent substructures
          2. Relationship or rules discovery
            1. Association rules
            2. Correlation rules
        2. Market basket analysis
          1. The market basket model
          2. A-Priori algorithms
            1. Input data characteristics and data structure
            2. The A-Priori algorithm
            3. The R implementation
            4. A-Priori algorithm variants
          3. The Eclat algorithm
            1. The R implementation
          4. The FP-growth algorithm
            1. Input data characteristics and data structure
            2. The FP-growth algorithm
            3. The R implementation
          5. The GenMax algorithm with maximal frequent itemsets
            1. The R implementation
          6. The Charm algorithm with closed frequent itemsets
            1. The R implementation
          7. The algorithm to generate association rules
            1. The R implementation
        3. Hybrid association rules mining
          1. Mining multilevel and multidimensional association rules
          2. Constraint-based frequent pattern mining
        4. Mining sequence dataset
          1. Sequence dataset
          2. The GSP algorithm
        5. The R implementation
          1. The SPADE algorithm
            1. The R implementation
          2. Rule generation from sequential patterns
        6. High-performance algorithms
        7. Time for action
        8. Summary
      3. 3. Classification
        1. Classification
        2. Generic decision tree induction
          1. Attribute selection measures
          2. Tree pruning
          3. General algorithm for the decision tree generation
          4. The R implementation
        3. High-value credit card customers classification using ID3
          1. The ID3 algorithm
          2. The R implementation
          3. Web attack detection
          4. High-value credit card customers classification
        4. Web spam detection using C4.5
          1. The C4.5 algorithm
          2. The R implementation
          3. A parallel version with MapReduce
          4. Web spam detection
        5. Web key resource page judgment using CART
          1. The CART algorithm
          2. The R implementation
          3. Web key resource page judgment
        6. Trojan traffic identification method and Bayes classification
          1. Estimating
            1. Prior probability estimation
            2. Likelihood estimation
          2. The Bayes classification
          3. The R implementation
          4. Trojan traffic identification method
        7. Identify spam e-mail and Naïve Bayes classification
          1. The Naïve Bayes classification
          2. The R implementation
          3. Identify spam e-mail
        8. Rule-based classification of player types in computer games and rule-based classification
          1. Transformation from decision tree to decision rules
          2. Rule-based classification
          3. Sequential covering algorithm
          4. The RIPPER algorithm
            1. The R implementation
          5. Rule-based classification of player types in computer games
        9. Time for action
        10. Summary
      4. 4. Advanced Classification
        1. Ensemble (EM) methods
          1. The bagging algorithm
          2. The boosting and AdaBoost algorithms
          3. The Random forests algorithm
          4. The R implementation
          5. Parallel version with MapReduce
        2. Biological traits and the Bayesian belief network
          1. The Bayesian belief network (BBN) algorithm
          2. The R implementation
          3. Biological traits
        3. Protein classification and the k-Nearest Neighbors algorithm
          1. The kNN algorithm
          2. The R implementation
        4. Document retrieval and Support Vector Machine
          1. The SVM algorithm
          2. The R implementation
          3. Parallel version with MapReduce
          4. Document retrieval
        5. Classification using frequent patterns
          1. The associative classification
            1. CBA
          2. Discriminative frequent pattern-based classification
          3. The R implementation
          4. Text classification using sentential frequent itemsets
        6. Classification using the backpropagation algorithm
          1. The BP algorithm
          2. The R implementation
          3. Parallel version with MapReduce
        7. Time for action
        8. Summary
      5. 5. Cluster Analysis
        1. Search engines and the k-means algorithm
          1. The k-means clustering algorithm
          2. The kernel k-means algorithm
          3. The k-modes algorithm
          4. The R implementation
          5. Parallel version with MapReduce
          6. Search engine and web page clustering
        2. Automatic abstraction of document texts and the k-medoids algorithm
          1. The PAM algorithm
          2. The R implementation
          3. Automatic abstraction and summarization of document text
        3. The CLARA algorithm
          1. The CLARA algorithm
          2. The R implementation
        4. CLARANS
          1. The CLARANS algorithm
          2. The R implementation
        5. Unsupervised image categorization and affinity propagation clustering
          1. Affinity propagation clustering
          2. The R implementation
          3. Unsupervised image categorization
          4. The spectral clustering algorithm
          5. The R implementation
        6. News categorization and hierarchical clustering
          1. Agglomerative hierarchical clustering
          2. The BIRCH algorithm
          3. The chameleon algorithm
          4. The Bayesian hierarchical clustering algorithm
          5. The probabilistic hierarchical clustering algorithm
          6. The R implementation
          7. News categorization
        7. Time for action
        8. Summary
      6. 6. Advanced Cluster Analysis
        1. Customer categorization analysis of e-commerce and DBSCAN
          1. The DBSCAN algorithm
          2. Customer categorization analysis of e-commerce
        2. Clustering web pages and OPTICS
          1. The OPTICS algorithm
          2. The R implementation
          3. Clustering web pages
        3. Visitor analysis in the browser cache and DENCLUE
          1. The DENCLUE algorithm
          2. The R implementation
          3. Visitor analysis in the browser cache
        4. Recommendation system and STING
          1. The STING algorithm
          2. The R implementation
          3. Recommendation systems
        5. Web sentiment analysis and CLIQUE
          1. The CLIQUE algorithm
          2. The R implementation
          3. Web sentiment analysis
        6. Opinion mining and WAVE clustering
          1. The WAVE cluster algorithm
          2. The R implementation
          3. Opinion mining
        7. User search intent and the EM algorithm
          1. The EM algorithm
          2. The R implementation
          3. The user search intent
        8. Customer purchase data analysis and clustering high-dimensional data
          1. The MAFIA algorithm
          2. The SURFING algorithm
          3. The R implementation
          4. Customer purchase data analysis
        9. SNS and clustering graph and network data
          1. The SCAN algorithm
          2. The R implementation
          3. Social networking service (SNS)
        10. Time for action
        11. Summary
      7. 7. Outlier Detection
        1. Credit card fraud detection and statistical methods
          1. The likelihood-based outlier detection algorithm
          2. The R implementation
          3. Credit card fraud detection
        2. Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
          1. The NL algorithm
          2. The FindAllOutsM algorithm
          3. The FindAllOutsD algorithm
          4. The distance-based algorithm
          5. The Dolphin algorithm
          6. The R implementation
          7. Activity monitoring and the detection of mobile fraud
        3. Intrusion detection and density-based methods
          1. The OPTICS-OF algorithm
          2. The High Contrast Subspace algorithm
          3. The R implementation
          4. Intrusion detection
        4. Intrusion detection and clustering-based methods
          1. Hierarchical clustering to detect outliers
          2. The k-means-based algorithm
          3. The ODIN algorithm
          4. The R implementation
        5. Monitoring the performance of the web server and classification-based methods
          1. The OCSVM algorithm
          2. The one-class nearest neighbor algorithm
          3. The R implementation
          4. Monitoring the performance of the web server
        6. Detecting novelty in text, topic detection, and mining contextual outliers
          1. The conditional anomaly detection (CAD) algorithm
          2. The R implementation
          3. Detecting novelty in text and topic detection
        7. Collective outliers on spatial data
          1. The route outlier detection (ROD) algorithm
          2. The R implementation
          3. Characteristics of collective outliers
        8. Outlier detection in high-dimensional data
          1. The brute-force algorithm
          2. The HilOut algorithm
          3. The R implementation
        9. Time for action
        10. Summary
      8. 8. Mining Stream, Time-series, and Sequence Data
        1. The credit card transaction flow and STREAM algorithm
          1. The STREAM algorithm
          2. The single-pass-any-time clustering algorithm
          3. The R implementation
          4. The credit card transaction flow
        2. Predicting future prices and time-series analysis
          1. The ARIMA algorithm
          2. Predicting future prices
        3. Stock market data and time-series clustering and classification
          1. The hError algorithm
          2. Time-series classification with the 1NN classifier
          3. The R implementation
          4. Stock market data
        4. Web click streams and mining symbolic sequences
          1. The TECNO-STREAMS algorithm
          2. The R implementation
          3. Web click streams
        5. Mining sequence patterns in transactional databases
          1. The PrefixSpan algorithm
          2. The R implementation
        6. Time for action
        7. Summary
      9. 9. Graph Mining and Network Analysis
        1. Graph mining
          1. Graph
          2. Graph mining algorithms
        2. Mining frequent subgraph patterns
          1. The gPLS algorithm
          2. The GraphSig algorithm
          3. The gSpan algorithm
          4. Rightmost path extensions and their supports
          5. The subgraph isomorphism enumeration algorithm
          6. The canonical checking algorithm
          7. The R implementation
        3. Social network mining
          1. Community detection and the shingling algorithm
          2. The node classification and iterative classification algorithms
          3. The R implementation
        4. Time for action
        5. Summary
      10. 10. Mining Text and Web Data
        1. Text mining and TM packages
        2. Text summarization
          1. Topic representation
          2. The multidocument summarization algorithm
          3. The Maximal Marginal Relevance algorithm
          4. The R implementation
        3. The question answering system
        4. Genre categorization of web pages
        5. Categorizing newspaper articles and newswires into topics
          1. The N-gram-based text categorization
          2. The R implementation
        6. Web usage mining with web logs
          1. The FCA-based association rule mining algorithm
          2. The R implementation
        7. Time for action
        8. Summary
      11. A. Algorithms and Data Structures
    7. 2. Module 2
      1. 1. Data Manipulation Using In-built R Data
        1. What is data mining?
          1. How is it related to data science, analytics, and statistical modeling?
        2. Introduction to the R programming language
          1. Getting started with R
          2. Data types, vectors, arrays, and matrices
          3. List management, factors, and sequences
          4. Import and export of data types
        3. Data type conversion
        4. Sorting and merging dataframes
        5. Indexing or subsetting dataframes
        6. Date and time formatting
        7. Creating new functions
          1. User-defined functions
          2. Built-in functions
        8. Loop concepts - the for loop
        9. Loop concepts - the repeat loop
        10. Loop concepts - while conditions
        11. Apply concepts
        12. String manipulation
        13. NA and missing value management
        14. Missing value imputation techniques
        15. Summary
      2. 2. Exploratory Data Analysis with Automobile Data
        1. Univariate data analysis
        2. Bivariate analysis
        3. Multivariate analysis
        4. Understanding distributions and transformation
          1. Normal probability distribution
          2. Binomial probability distribution
          3. Poisson probability distribution
        5. Interpreting distributions
          1. Interpreting continuous data
        6. Variable binning or discretizing continuous data
        7. Contingency tables, bivariate statistics, and checking for data normality
        8. Hypothesis testing
          1. Test of the population mean
            1. One tail test of mean with known variance
            2. One tail and two tail test of proportions
          2. Two sample variance test
        9. Non-parametric methods
          1. Wilcoxon signed-rank test
          2. Mann-Whitney-Wilcoxon test
          3. Kruskal-Wallis test
        10. Summary
      3. 3. Visualize Diamond Dataset
        1. Data visualization using ggplot2
          1. Bar chart
          2. Boxplot
          3. Bubble chart
          4. Donut chart
          5. Geo mapping
          6. Histogram
          7. Line chart
          8. Pie chart
          9. Scatterplot
          10. Stacked bar chart
          11. Stem and leaf plot
          12. Word cloud
          13. Coxcomb plot
        2. Using plotly
          1. Bubble plot
          2. Bar charts using plotly
          3. Scatterplot using plotly
          4. Boxplots using plotly
          5. Polar charts using plotly
          6. Polar scatterplot using plotly
          7. Polar area chart
        3. Creating geo mapping
        4. Summary
      4. 4. Regression with Automobile Data
        1. Regression introduction
          1. Formulation of regression problem
          2. Case study
        2. Linear regression
        3. Stepwise regression method for variable selection
        4. Logistic regression
        5. Cubic regression
        6. Penalized regression
        7. Summary
      5. 5. Market Basket Analysis with Groceries Data
        1. Introduction to Market Basket Analysis
          1. What is MBA?
          2. Where to apply MBA?
          3. Data requirement
          4. Assumptions/prerequisites
          5. Modeling techniques
          6. Limitations
        2. Practical project
          1. Apriori algorithm
          2. Eclat algorithm
          3. Visualizing association rules
          4. Implementation of arules
        3. Summary
      6. 6. Clustering with E-commerce Data
        1. Understanding customer segmentation
          1. Why understanding customer segmentation is important
          2. How to perform customer segmentation?
        2. Various clustering methods available
          1. K-means clustering
          2. Hierarchical clustering
          3. Model-based clustering
          4. Other cluster algorithms
          5. Comparing clustering methods
        3. References
        4. Summary
      7. 7. Building a Retail Recommendation Engine
        1. What is recommendation?
          1. Types of product recommendation
          2. Techniques to perform recommendation
        2. Assumptions
        3. What method to apply when
        4. Limitations of collaborative filtering
        5. Practical project
        6. Summary
      8. 8. Dimensionality Reduction
        1. Why dimensionality reduction?
          1. Techniques available for dimensionality reduction
            1. Which technique to apply where?
              1. Principal component analysis
        2. Practical project around dimensionality reduction
          1. Attribute description
        3. Parametric approach to dimension reduction
        4. References
        5. Summary
      9. 9. Applying Neural Network to Healthcare Data
        1. Introduction to neural networks
        2. Understanding the math behind the neural network
        3. Neural network implementation in R
        4. Neural networks for prediction
        5. Neural networks for classification
        6. Neural networks for forecasting
        7. Merits and demerits of neural networks
        8. References
        9. Summary
    8. 3. Module 3
      1. 1. Going Viral
        1. Social media mining using sentiment analysis
        2. The state of communication
        3. What is Big Data?
        4. Human sensors and honest signals
        5. Quantitative approaches
        6. Summary
      2. 2. Getting Started with R
        1. Why R?
        2. Quick start
          1. The basics – assignment and arithmetic
          2. Functions, arguments, and help
        3. Vectors, sequences, and combining vectors
        4. A quick example – creating data frames and importing files
        5. Visualization in R
        6. Style and workflow
        7. Additional resources
        8. Summary
      3. 3. Mining Twitter with R
        1. Why Twitter data?
        2. Obtaining Twitter data
        3. Preliminary analyses
        4. Summary
      4. 4. Potentials and Pitfalls of Social Media Data
        1. Opinion mining made difficult
        2. Sentiment and its measurement
        3. The nature of social media data
        4. Traditional versus nontraditional social data
        5. Measurement and inferential challenges
        6. Summary
      5. 5. Social Media Mining – Fundamentals
        1. Key concepts of social media mining
        2. Good data versus bad data
        3. Understanding sentiments
          1. Scherer's typology of emotions
        4. Sentiment polarity – data and classification
        5. Supervised social media mining – lexicon-based sentiment
        6. Supervised social media mining – Naive Bayes classifiers
        7. Unsupervised social media mining – Item Response Theory for text scaling
        8. Summary
      6. 6. Social Media Mining – Case Studies
        1. Introductory considerations
        2. Case study 1 – supervised social media mining – lexicon-based sentiment
        3. Case study 2 – Naive Bayes classifier
        4. Case study 3 – IRT models for unsupervised sentiment scaling
        5. Summary
      7. A. Conclusions and Next Steps
        1. Final thoughts
        2. An expanding field
        3. Further reading
        4. Bibliography
    9. Bibliography
    10. Index