Chapter 3Technical Representation of Data

Ideally, data analysts should not have to worry too much about how exactly data is technically stored in computer memory. Indeed, much effort in computer engineering has gone into trying to abstract away technical details, such as how a real number is represented as a sequence of bits.

In practice, however, consequences of technical choices eventually pop up. For example, there is probably hardly any computer user who has not at some point seen symbols similar to c03-math-001 or c03-math-002 appearing in their text editor. Such symbols indicate that the program displaying the text was not able to translate the string of bytes it read into readable characters. As a second example, consider the output of the following calculation in R.

  if ( 1 - 0.9 == 0.1 ) print("ok") else print("oh no!")
  ## [1] "oh no!"

Although it seems reasonable to expect "ok", apparently c03-math-003 is not precisely equal to 0.1 for a computer although the difference is admittedly small.

  (1-0.9) - 0.1
  ## [1] -2.775558e-17

These examples are forms of what is commonly termed abstraction leakage: issues that have to do with the underlying technical representation of data exposed to the user. In the case ...

Get Statistical Data Cleaning with Applications in R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.