3Data Organization and First Data Frame Operations
Tabular data could be organized in different forms, with rows, columns, and values associated with information of various natures and carrying different meanings. Often, a specific organization of data is chosen to enhance readability; in other cases, it merely reflects characteristics of the data source or the data ingestion process (e.g. an automatic measurement process, an online data stream, a manual data entry), or it is functional for a certain transformation, computation, or visualization to be executed.
There exists a particular organization of data called tidy that is typically considered the reference model to be rational and suitable for further manipulations with computational or analytical tools. It has three main characteristics:
- Each row represents a single observation of the phenomenon.
- Each column represents a specific property (also called variable) of the phenomenon.
- Each value represents a single information rather than an aggregate.
For example, consider datasets with personal information on students enrolled in courses or employees working at a certain office. Each row would likely correspond to a single individual (observation), with columns representing relevant information (variables) for the specific context for which data have been produced. Values would carry single information like the initial name, the middle name or the surname, place of birth, birth date, and so on, each one associated with ...