Representing the Data Online

Our aim is to make all of the experimental record and processed data available online. This raises a number of issues for how to represent the data in a useful form on the Web, including the choice of standardized identifiers, visualization tools, and approaches to data integration.

Unique Identifiers for Chemical Entities

To make our data useful, it is important that the chemical entities be described using a recognized standard. Without this, integration with other data sets will be difficult or impossible. In chemistry, some would argue that CAS Registry Numbers (http://en.wikipedia.org/wiki/CAS_registry_number) would be ideal for identifying chemical entities. However, CAS numbers are proprietary in nature, cannot be converted to the chemical structure, are a lookup only, and are dependent on an external organization to issue. We would prefer identifiers that are open in nature, freely available for exchange, and can be converted to and from a chemical connection table.

The IUPAC International Chemical Identifier (InChI, pronounced "INchee") provides a nonproprietary standard and algorithms along with supporting open source software (http://en.wikipedia.org/wiki/Inchi) that enable the generation of identity strings that can be converted back to structures (see http://www.qsarworld.com/INCHI1.php for a recent review). InChI is gaining significant support as a standard across software vendors, publishers, and developers. The problems with the algorithm—which ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.