Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

Book description

A hands on guide to web scraping and text mining for both beginners and experienced users of R

  • Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.

  • Provides basic techniques to query web documents and data sets (XPath and regular expressions).

  • An extensive set of exercises are presented to guide the reader through each technique.

  • Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.

  • Case studies are featured throughout along with examples for each technique presented.

  • R code and solutions to exercises featured in the book are provided on a supporting website.

  • Table of contents

    1. Preface
      1. What you won't learn from reading this book
      2. Why R?
      3. Recommended reading to get started with R
      4. Typographic conventions
      5. The book's website
      6. Disclaimer
      7. Acknowledgments
      8. Note
    2. Chapter 1: Introduction
      1. 1.1 Case study: World Heritage Sites in Danger
      2. 1.2 Some remarks on web data quality
      3. 1.3 Technologies for disseminating, extracting, and storing web data
      4. 1.4 Structure of the book
      5. Notes
    3. Part One: A Primer on Web and Data Technologies
      1. Chapter 2: HTML
        1. 2.1 Browser presentation and source code
        2. 2.2 Syntax rules
        3. 2.3 Tags and attributes
        4. 2.4 Parsing
        5. Summary
        6. Further reading
        7. Problems
        8. Notes
      2. Chapter 3: XML and JSON
        1. 3.1 A short example XML document
        2. 3.2 XML syntax rules
        3. 3.3 When is an XML document well formed or valid?
        4. 3.4 XML extensions and technologies
        5. 3.5 XML and R in practice
        6. 3.6 A short example JSON document
        7. 3.7 JSON syntax rules
        8. 3.8 JSON and R in practice
        9. Summary
        10. Further reading
        11. Problems
        12. Notes
      3. Chapter 4: XPath
        1. 4.1 XPath—a query language for web documents
        2. 4.2 Identifying node sets with XPath
        3. 4.3 Extracting node elements
        4. Summary
        5. Further reading
        6. Problems
        7. Notes
      4. Chapter 5: HTTP
        1. 5.1 HTTP fundamentals
        2. 5.2 Advanced features of HTTP
        3. 5.3 Protocols beyond HTTP
        4. 5.4 HTTP in action
        5. Summary
        6. Further reading
        7. Problems
        8. Notes
      5. Chapter 6: AJAX
        1. 6.1 JavaScript
        2. 6.2 XHR
        3. 6.3 Exploring AJAX with Web Developer Tools
        4. Summary
        5. Further reading
        6. Problems
      6. Chapter 7: SQL and relational databases
        1. 7.1 Overview and terminology
        2. 7.2 Relational Databases
        3. 7.3 SQL: a language to communicate with Databases
        4. 7.4 Databases in action
        5. Summary
        6. Further reading
        7. Problems
        8. Pokemon problems
        9. ParlGov problems
        10. Notes
      7. Chapter 8: Regular expressions and essential string functions
        1. 8.1 Regular expressions
        2. 8.2 String processing
        3. 8.3 A word on character encodings
        4. Summary
        5. Further reading
        6. Problems
        7. Notes
    4. Part Two: A Practical Toolbox for Web Scraping and Text Mining
      1. Chapter 9: Scraping the Web
        1. 9.1 Retrieval scenarios
        2. 9.2 Extraction strategies
        3. 9.3 Web scraping: Good practice
        4. 9.4 Valuable sources of inspiration
        5. Summary
        6. Further reading
        7. Problems
        8. Notes
      2. Chapter 10: Statistical text processing
        1. 10.1 The running example: Classifying press releases of the British government
        2. 10.2 Processing textual data
        3. 10.3 Supervised learning techniques
        4. 10.4 Unsupervised learning techniques
        5. Summary
        6. Further reading
        7. Notes
      3. Chapter 11: Managing data projects
        1. 11.1 Interacting with the file system
        2. 11.2 Processing multiple documents/links
        3. 11.3 Organizing scraping procedures
        4. 11.4 Executing R scripts on a regular basis
        5. Notes
    5. Part Three: A Bag of Case Studies
      1. Chapter 12: Collaboration networks in the US Senate
        1. 12.1 Information on the bills
        2. 12.2 Information on the senators
        3. 12.3 Analyzing the network structure
        4. 12.4 Conclusion
        5. Notes
      2. Chapter 13: Parsing information from semistructured documents
        1. 13.1 Downloading data from the FTP server
        2. 13.2 Parsing semistructured text data
        3. 13.3 Visualizing station and temperature data
        4. Notes
      3. Chapter 14: Predicting the 2014 Academy Awards using Twitter
        1. 14.1 Twitter APIs: Overview
        2. 14.2 Twitter-based forecast of the 2014 Academy Awards
        3. 14.3 Conclusion
        4. Notes
      4. Chapter 15: Mapping the geographic distribution of names
        1. 15.1 Developing a data collection strategy
        2. 15.2 Website inspection
        3. 15.3 Data retrieval and information extraction
        4. 15.4 Mapping names
        5. 15.5 Automating the process
        6. Summary
        7. Notes
      5. Chapter 16: Gathering data on mobile phones
        1. 16.1 Page exploration
        2. 16.2 Scraping procedure
        3. 16.3 Graphical analysis
        4. 16.4 Data storage
        5. Note
      6. Chapter 17: Analyzing sentiments of product reviews
        1. 17.1 Introduction
        2. 17.2 Collecting the data
        3. 17.3 Analyzing the data
        4. 17.4 Conclusion
        5. Notes
    6. References
    7. General index
    8. Package index
    9. Function index
    10. End User License Agreement

    Product information

    • Title: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
    • Author(s): Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis
    • Release date: January 2015
    • Publisher(s): Wiley
    • ISBN: 9781118834817