11

Anteater: Service-orienteddata mining

Renato A. Ferreira, Dorgival O. Guedes and Wagner Meira Jr.

ABSTRACT

Data mining focuses on extracting useful information from large volumes of data, and thus has been the centre of great attention in recent years. Building scalable, extensible and easy-to-use data mining systems, however, has proved to be a hard task. This chapter discusses Anteater, a service-oriented architecture for data mining, which relies on Web services to achieve extensibility, offers simple abstractions for users and supports computationally intensive processing on large amounts of data. Anteater relies on Anthill, a run-time system for irregular, data intensive, iterative distributed applications, to achieve high performance. Data mining algorithms are irregular because the computation is irregularly distributed over the input data, which greatly complicates the parallelization. It is data intensive, for it deals with potentially enormous data sets. And it is iterative as many of the algorithms make multiple passes over the input in a sort of a refining process to come up with the output. The combination of a Web service architecture and a parallel programming environment provides a rich environment for exploring different levels of distributed processing with good scalability. Anteater is operational and being used by the Brazilian government to analyse government expenditure, public health and public safety policies. Feedback from Anteater users has been very ...

Get Data Mining Techniques in Grid Computing Environments now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.