Chapter 4Data Structure

4.1 Introduction

An important step in data cleaning is getting raw data into a convenient structure for analysis. Data values must be coded in the right format, but the correct technical representation is not sufficient for making the data processable for statistical purposes. The collection of values must also be structured in a convenient manner. Raw data can have different structures, depending on the collection process or experimental setup. In this chapter, we will describe the commonly encountered structures. Most statistical analyses are done on tabular and matrix data, so a common step in analysis is restructuring and transforming raw data into tabular or matrix data that can be processed. In R, this is typically a data.frame or matrix.

Table 4.1 shows various data structures that are encountered when analyzing data.

Table 4.1 Often encountered data types and their structure

Data types Data structure
Various Tabular
Numeric Matrix
Numeric + time Time series
json/xml Hierarchical
html Text/hierarchical
Free text Text

Most data can be transformed into tabular format, but it does not follow that it can be analyzed directly: the columns of a table should denote the variables to be analyzed, and the rows should describe the members of the statistical population of interest.

4.2 Tabular Data

A table or dataset is a rectangular collection of values in which each column denotes a (measured) value of a variable. A row describes an ...

Get Statistical Data Cleaning with Applications in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.