11Description of Datasets

Here we describe the datasets used in the book


The solubility of alcohols in water is important in understanding alcohol transport in living organisms. This dataset from (Romanelli et al., 2001) contains physicochemical characteristics of 44 aliphatic alcohols. The aim of the experiment was the prediction of the solubility on the basis of molecular descriptors. The columns are:

  1. SAG: solvent accessible surface‐bounded molecular volume
  2. V: volume
  3. Log PC: (octanol‐water partitions coefficient)
  4. P: polarizability
  5. RM: molar refractivity
  6. Mass
  7. ln(solubility) (response)


This dataset is part of a larger one (http://kdd.ics.uci.edu/databases/coil/coil.html), which comes from a water quality study where samples were taken from sites on different European rivers over a period of approximately one year. These samples were analyzed for various chemical substances. In parallel, algae samples were collected to determine the algal population distributions. The columns are:

  1. 1. Season (1,2,3,4 for winter, spring, summer and autumn)
  2. 2. River size (1,2,3 for small, medium and large)
  3. 3. Fluid velocity (1,2,3 for low, medium and high)
  4. 4-11. Content of nitrogen in the form of nitrates, nitrites and ammonia, and other chemical compounds

The response is the abundance of a type of algae (type 6 in the complete file). For simplicity we deleted the rows with missing values, or with null response values, and took the logarithm of the response.


There ...

Get Robust Statistics, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.