Data Structure and Software Engineering

An Open-Source

Representation for 2-DE-

Centric Proteomics and

Support Infrastructure for

Data Storage and Analysis

Romesh Stanislaus, John M. Arthur, Balaji Rajagopalan,

Rick Moerschell, Brian McGlothlen and Jonas S. Almeida

ABSTRACT

Background

In spite of two-dimensional gel electrophoresis (2-DE) being an eective and

widely used method to screen the proteome, its data standardization has still

not matured to the level of microarray genomics data or mass spectrometry ap-

proaches. e trend toward identifying encompassing data standards has been

expanding from genomics to transcriptomics, and more recently to proteomics.

272 Data Structure and Software Engineering: Challenges and Improvements

e relative success of genomic and transcriptomic data standardization has

enabled the development of central repositories such as GenBank and Gene

Expression Omnibus. An equivalent 2-DE-centric data structure would sim-

ilarly have to include a balance among raw data, basic feature detection re-

sults, suciency in the description of the experimental context and methods,

and an overall structure that facilitates a diversity of usages, from central re-

position to local data representation in LIMs systems.

Results & Conclusion

Achieving such a balance can only be accomplished through several itera-

tions involving bioinformaticians, bench molecular biologists, and the man-

ufacturers of the equipment and commercial software from which the data is

primarily generated. Such an encompassing data structure is described here,

developed as the mature successor to the well established and broadly used ear-

lier version. A public repository, AGML Central, is congured with a suite of

tools for the conversion from a variety of popular formats, web-based visual-

ization, and interoperation with other tools and repositories, and is particu-

larly mass-spectrometry oriented with I/O for annotation and data analysis.

Background

e post genomic era has seen an increasing eort put into systematic surveys

of various proteomes. Consequently, proteomics is rapidly evolving into a high

throughput experimental approach that enables the identication, for example, of

dierentially expressed proteins as biomarkers for disease and pathogenesis. Simi-

larly, there is a critical need for central repositories and common data formats to

make the most of the copious amounts of data generated by the dierent screen-

ing initiatives. e higher methodological complexity of proteomics makes data

integration a challenge, greatly complicated by the fact that there are no compre-

hensive data structures in many proteomic elds. In spite of the fact that separa-

tion by 2-dimensional gel electrophoresis (2-DE) followed by spot identication

by mass spectrometry has been a major workhorse and a versatile tool in discovery

proteomics [1,2], it remains under-supported by stable data formats and reposito-

ries. High resolution 2-DE provides a powerful tool for the reproducible separa-

tion, visualization, and quantication of thousands of proteins in a single gel. e

increasing variety and amount of proteins being separated and the number of re-

searchers using the 2-DE method has generated an immense diversity of datasets

produced by dierent laboratories and using dierent instruments.

e lack of common formats has had an even more pernicious eect at the

level of centralized data reposition, as well as in the development of incipient

Get Data Structure and Software Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Structure and Software Engineering by James L. Antonakos

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly