An Open-Source
Representation for 2-DE-
Centric Proteomics and
Support Infrastructure for
Data Storage and Analysis
Romesh Stanislaus, John M. Arthur, Balaji Rajagopalan,
Rick Moerschell, Brian McGlothlen and Jonas S. Almeida
ABSTRACT
Background
In spite of two-dimensional gel electrophoresis (2-DE) being an eective and
widely used method to screen the proteome, its data standardization has still
not matured to the level of microarray genomics data or mass spectrometry ap-
proaches. e trend toward identifying encompassing data standards has been
expanding from genomics to transcriptomics, and more recently to proteomics.
272 Data Structure and Software Engineering: Challenges and Improvements
e relative success of genomic and transcriptomic data standardization has
enabled the development of central repositories such as GenBank and Gene
Expression Omnibus. An equivalent 2-DE-centric data structure would sim-
ilarly have to include a balance among raw data, basic feature detection re-
sults, suciency in the description of the experimental context and methods,
and an overall structure that facilitates a diversity of usages, from central re-
position to local data representation in LIMs systems.
Results & Conclusion
Achieving such a balance can only be accomplished through several itera-
tions involving bioinformaticians, bench molecular biologists, and the man-
ufacturers of the equipment and commercial software from which the data is
primarily generated. Such an encompassing data structure is described here,
developed as the mature successor to the well established and broadly used ear-
lier version. A public repository, AGML Central, is congured with a suite of
tools for the conversion from a variety of popular formats, web-based visual-
ization, and interoperation with other tools and repositories, and is particu-
larly mass-spectrometry oriented with I/O for annotation and data analysis.
Background
e post genomic era has seen an increasing eort put into systematic surveys
of various proteomes. Consequently, proteomics is rapidly evolving into a high
throughput experimental approach that enables the identication, for example, of
dierentially expressed proteins as biomarkers for disease and pathogenesis. Simi-
larly, there is a critical need for central repositories and common data formats to
make the most of the copious amounts of data generated by the dierent screen-
ing initiatives. e higher methodological complexity of proteomics makes data
integration a challenge, greatly complicated by the fact that there are no compre-
hensive data structures in many proteomic elds. In spite of the fact that separa-
tion by 2-dimensional gel electrophoresis (2-DE) followed by spot identication
by mass spectrometry has been a major workhorse and a versatile tool in discovery
proteomics [1,2], it remains under-supported by stable data formats and reposito-
ries. High resolution 2-DE provides a powerful tool for the reproducible separa-
tion, visualization, and quantication of thousands of proteins in a single gel. e
increasing variety and amount of proteins being separated and the number of re-
searchers using the 2-DE method has generated an immense diversity of datasets
produced by dierent laboratories and using dierent instruments.
e lack of common formats has had an even more pernicious eect at the
level of centralized data reposition, as well as in the development of incipient

Get Data Structure and Software Engineering now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.