Building and using analyticalworkflows in Discovery Net
Moustafa Ghanem, Vasa Curcin, Patrick Wendel and Yike Guo
The Discovery Net platform is built around a workflow model for integrating distributed data sources and analytical tools. The platform was originally designed to support the design and execution of distributed data mining tasks within a grid-based environment. However, over the years it has evolved into a generic data analysis platform with applications in such diverse areas as bioinformatics, cheminformatics, text mining and business intelligence. In this work we present our experiences in designing the platform. We also map the evolution of its workflow server architecture to meet the demands of these different applications.
The recent interest of the scientific and business communities in data mining has been primarily driven by our increasing ability to generate, capture and share more data than could be analysed using traditional methods.
Motivating examples abound in the domain of bioinformatics where routine scientific experiments can generate thousands of gene measurements from a single biological sample. The samples themselves are typically collected from large numbers of subjects under different conditions. With a large number of data points generated, the analysis can proceed only with the use of high-performance computing resources. Furthermore, the interpretation of the experimental data and analysis results typically requires ...