Chapter 11

Big Data Clustering

Hanghang Tong

IBM T. J. Watson Research CenterYorktown Heights, NYhtong@us.ibm.com

U Kang

KAISTSeoul, Koreaukang@cs.cmu.edu

11.1 Introduction

With the advance of Web2.0, the data size is increasing explosively. For example, Twitter data spans several terabytes; Wikipedia data (e.g., articles and authors) is of similar size; web click-through data is reported to reach petabyte scale [36]; Yahoo! web graph in 2002 has more than 1 billion nodes and almost 7 billion edges [27].

On the other hand, many data clustering algorithms have a high intrinsic time complexity. For example, the classic k-means clustering is NP-hard even when k = 2. The normalized cut (NCut), a representative spectral clustering algorithm, is also ...

Get Data Clustering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.