Providing the Raw Data Back to the Notebook

As part of a wider program of drug discovery research (Bradley 2007) led by Professor Jean-Claude Bradley, we wished to predict the solubility of a wide range of chemicals in nonaqueous solvents such as ethanol, methanol, etc. Of greatest interest was the solubility of aldehydes, carboxylic acids, isonitriles, and primary amines—components required for the Ugi reaction that the Bradley group use to synthesize potential antimalarial targets (Bradley et al. 2008). The solubility of a specific compound is the quantity of that compound that can be dissolved in a specific solvent. Building and validating a model that could predict solubility would require a large data set of such solubility values. Surprisingly, there was no readily available database of nonaqueous solubilities. We therefore elected to crowdsource the data, opening up the measurements to anyone who wanted to be involved (http://onschallenge.wikispaces.com/ ). However, this poses a series of problems. As anyone can contribute measurements, we have no upfront way of checking the quality of those measurements.

The first stage in creating our data set therefore required the creation of a detailed record of how each and every measurement was made. The measurement techniques, precision, and accuracy of different contributions all vary, but all the background information is provided in human-readable form. This "radical sharing" approach of making the complete research record available ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.