10

Specification of distributeddata mining workflowswith DataMiningGrid

Dennis Wegener and Michael May

ABSTRACT

This chapter gives an evaluation of the benefits of grid-based technology from a data miner's perspective. It is focused on the DataMiningGrid, a standard-based and extensible environment for grid-enabling data mining applications. Three generic and very common data mining tasks were analysed: enhancing scalability by data partitioning; comparing classifier performance and parameter optimization. Grid-based data mining and the DataMiningGrid in particular emerge as a generic tool for enhancing the scalability of a large number of data mining applications. The basis for this broad applicability is the DataMiningGrid's extensibility mechanism. To support the scenarios described above, we have extended the original DataMiningGrid system by a set of new components.

10.1 Introduction

An important benefit of embedding data mining into a grid environment is scalability. Data mining is computationally intensive: when searching for patterns, most algorithms perform a costly search or an optimization routine that scales between O(n) and O(n3) with the input data; where n, the number of data points, can be in the range from thousands to millions.

Improving scalability can be based on the fact that a number of algorithms are able to compute results on a subset of data in such a way that the results on the subset can be merged or aggregated to give an overall result. A well-known ...

Get Data Mining Techniques in Grid Computing Environments now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.