Chapter 11
Big Data Clustering
Hanghang Tong
IBM T. J. Watson Research CenterYorktown Heights, NYhtong@us.ibm.com
U Kang
KAISTSeoul, Koreaukang@cs.cmu.edu
11.1 Introduction
With the advance of Web2.0, the data size is increasing explosively. For example, Twitter data spans several terabytes; Wikipedia data (e.g., articles and authors) is of similar size; web click-through data is reported to reach petabyte scale [36]; Yahoo! web graph in 2002 has more than 1 billion nodes and almost 7 billion edges [27].
On the other hand, many data clustering algorithms have a high intrinsic time complexity. For example, the classic k-means clustering is NP-hard even when k = 2. The normalized cut (NCut), a representative spectral clustering algorithm, is also ...
Get Data Clustering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.