O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Clean Data

Book Description

Save time by discovering effortless strategies for cleaning, organizing, and manipulating your data

In Detail

Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.

The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.

At the end of the book, you will be given a chance to tackle a couple of real-world projects.

What You Will Learn

  • Understand the role of data cleaning in the overall data science process
  • Learn the basics of file formats, data types, and character encodings to clean data properly
  • Master critical features of the spreadsheet and text editor for organizing and manipulating data
  • Convert data from one common format to another, including JSON, CSV, and some special-purpose formats
  • Implement three different strategies for parsing and cleaning data found in HTML files on the Web
  • Reveal the mysteries of PDF documents and learn how to pull out just the data you want
  • Develop a range of solutions for detecting and cleaning bad data stored in an RDBMS
  • Create your own clean data sets that can be packaged, licensed, and shared with others
  • Use the tools from this book to complete two real-world projects using data from Twitter and Stack Overflow

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Clean Data
    1. Table of Contents
    2. Clean Data
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the color images of this book
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Why Do You Need Clean Data?
      1. A fresh perspective
      2. The data science process
      3. Communicating about data cleaning
      4. Our data cleaning environment
      5. An introductory example
      6. Summary
    9. 2. Fundamentals – Formats, Types, and Encodings
      1. File formats
        1. Text files versus binary files
          1. Opening and reading files
          2. Peeking inside files
            1. On OSX or Linux
            2. On Windows
        2. Common formats for text files
        3. The delimited format
          1. Seeing invisible characters
          2. Enclosing values to trap errant characters
          3. Escaping characters
          4. The JSON format
            1. Experimenting with JSON
          5. The HTML format
      2. Archiving and compression
        1. Archive files
          1. tar
        2. Compressed files
          1. How to compress files
          2. How to uncompress files
            1. Compression with zip, gzip, and bzip2
            2. Compression options
          3. Which compression program should I use?
            1. Rules of thumb
      3. Data types, nulls, and encodings
        1. Data types
          1. Numeric data
            1. Integers
            2. Numbers with decimals
            3. When numbers are not numeric
          2. Dates and time
          3. Strings
          4. Other data types
        2. Converting between data types
          1. Data loss
        3. Strategies for conversion
          1. Type conversion at the SQL level
            1. Example one – parsing MySQL date into a formatted string
            2. Example two – converting a string into MySQL's date type
            3. Example three – casting or converting MySQL string data to a decimal number
          2. Type conversion at the file level
            1. Example one – type detection and converting in Excel
            2. Example two – type converting in JSON
        4. If a null falls in a forest…
          1. Zero
          2. Empties
            1. Blanks
          3. Null
            1. Why is the middle name example "empty" and not NULL?
            2. Is it ever useful to clean data using a zero instead of an empty or null?
        5. Character encodings
          1. Example one – finding multibyte characters in MySQL data
          2. Example two – finding the UTF-8 and Latin-1 equivalents of Unicode characters stored in MySQL
          3. Example three – handling UTF-8 encoding at the file level
            1. Option one – strip out the UTF-8 characters
            2. Option two – write the file in a way that can handle UTF-8 characters
      4. Summary
    10. 3. Workhorses of Clean Data – Spreadsheets and Text Editors
      1. Spreadsheet data cleaning
        1. Text to columns in Excel
        2. Splitting strings
        3. Concatenating strings
          1. Conditional formatting to find unusual values
          2. Sorting to find unusual values
          3. Importing spreadsheet data into MySQL
            1. Creating CSV from a spreadsheet
            2. Generating SQL using a spreadsheet
      2. Text editor data cleaning
        1. Text tweaking
        2. The column mode
        3. Heavy duty find and replace
          1. A word of caution
        4. Text sorting and processing duplicates
        5. Process Lines Containing
      3. An example project
        1. Step one – state the problem
        2. Step two – data collection
          1. Download the data
          2. Get familiar with the data
        3. Step three – data cleaning
          1. Extracting relevant lines
            1. Using a spreadsheet
            2. Using a text editor
          2. Transform the lines
        4. Step four – data analysis
      4. Summary
    11. 4. Speaking the Lingua Franca – Data Conversions
      1. Quick tool-based conversions
        1. Spreadsheet to CSV
        2. Spreadsheet to JSON
          1. Step one – publish Google spreadsheet to the Web
          2. Step two – create the correct URL
        3. SQL to CSV or JSON using phpMyAdmin
      2. Converting with PHP
        1. SQL to JSON using PHP
        2. SQL to CSV using PHP
        3. JSON to CSV using PHP
        4. CSV to JSON using PHP
      3. Converting with Python
        1. CSV to JSON using Python
        2. CSV to JSON using csvkit
        3. Python JSON to CSV
      4. The example project
        1. Step one – download Facebook data as GDF
        2. Step two – look at the GDF file format in a text editor
        3. Step three – convert the GDF file into JSON
        4. Step four – build a D3 diagram
        5. Step five – convert data to the Pajek file format
        6. Step six – calculate simple network metrics
      5. Summary
    12. 5. Collecting and Cleaning Data from the Web
      1. Understanding the HTML page structure
        1. The line-by-line delimiter model
        2. The tree structure model
      2. Method one – Python and regular expressions
        1. Step one – find and save a Web file for experimenting
        2. Step two – look into the file and decide what is worth extracting
        3. Step three – write a Python program to pull out the interesting pieces and save them to a CSV file
        4. Step four – view the file and make sure it is clean
        5. The limitations of parsing HTML using regular expressions
      3. Method two – Python and BeautifulSoup
        1. Step one – find and save a file for experimenting
        2. Step two – install BeautifulSoup
        3. Step three – write a Python program to extract the data
        4. Step four – view the file and make sure it is clean
      4. Method three – Chrome Scraper
        1. Step one – install the Scraper Chrome extension
        2. Step two – collect data from the website
        3. Step three – final cleaning on the data columns
      5. Example project – Extracting data from e-mail and web forums
        1. The background of the project
        2. Part one – cleaning data from Google Groups e-mail
          1. Step one – collect the Google Groups messages
          2. Step two – extract data from the Google Groups messages
            1. Extraction code
            2. Program output
        3. Part two – cleaning data from web forums
          1. Step one – collect some RSS that points us to HTML files
          2. Step two – Extract URLs from RSS; collect and parse HTML
            1. Program status
            2. Program output
            3. Extraction code
      6. Summary
    13. 6. Cleaning Data in PDF Files
      1. Why is cleaning PDF files difficult?
      2. Try simple solutions first – copying
        1. Our experimental file
        2. Step one – try copying out the data we want
        3. Step two – try pasting the copied data into a text editor
        4. Step three – make a smaller version of the file
      3. Another technique to try – pdfMiner
        1. Step one – install pdfMiner
        2. Step two – pull text from the PDF file
      4. Third choice – Tabula
        1. Step one – download Tabula
        2. Step two – run Tabula
        3. Step three – direct Tabula to extract the data
        4. Step four – copy the data out
        5. Step five – more cleaning
      5. When all else fails – the fourth technique
      6. Summary
    14. 7. RDBMS Cleaning Techniques
      1. Getting ready
      2. Step one – download and examine Sentiment140
      3. Step two – clean for database import
      4. Step three – import the data into MySQL in a single table
        1. Detecting and cleaning abnormalities
        2. Creating our table
      5. Step four – clean the & character
      6. Step five – clean other mystery characters
        1. Step six – clean the dates
      7. Step seven – separate user mentions, hashtags, and URLs
        1. Create some new tables
        2. Extract user mentions
        3. Extract hashtags
        4. Extract URLs
      8. Step eight – cleaning for lookup tables
        1. Step nine – document what you did
      9. Summary
    15. 8. Best Practices for Sharing Your Clean Data
      1. Preparing a clean data package
        1. A word of caution – Using GitHub to distribute data
      2. Documenting your data
        1. README files
        2. File headers
        3. Data models and diagrams
        4. Documentation wiki or CMS
      3. Setting terms and licenses for your data
        1. Common terms of use
          1. Creative Commons
          2. ODbL and Open Data Commons
      4. Publicizing your data
        1. Lists of datasets
        2. Open Data on Stack Exchange
        3. Hackathons
      5. Summary
    16. 9. Stack Overflow Project
      1. Step one – posing a question about Stack Overflow
      2. Step two – collecting and storing the Stack Overflow data
        1. Downloading the Stack Overflow data dump
        2. Unarchiving the files
        3. Creating MySQL tables and loading data
        4. Building test tables
      3. Step three – cleaning the data
        1. Creating the new tables
        2. Extracting URLs and populating the new tables
        3. Extracting code and populating new tables
      4. Step four – analyzing the data
        1. Which paste sites are most popular?
        2. Which paste sites are popular in questions and which are popular in answers?
        3. Do posts contain both URLs to paste sites and source code?
      5. Step five – visualizing the data
      6. Step six – problem resolution
      7. Moving from test tables to full tables
      8. Summary
    17. 10. Twitter Project
      1. Step one – posing a question about an archive of tweets
      2. Step two – collecting the data
        1. Download and extract the Ferguson file
        2. Create a test version of the file
        3. Hydrate the tweet IDs
          1. Setting up a Twitter developer account
          2. Installing twarc
          3. Running twarc
      3. Step three – data cleaning
        1. Creating database tables
        2. Populating the new tables in Python
      4. Step four – simple data analysis
      5. Step five – visualizing the data
      6. Step six – problem resolution
      7. Moving this process into full (non-test) tables
      8. Summary
    18. Index