Chapter 6

Identifying Similarities in Data

IN THIS CHAPTER

Clustering data

Identifying hidden groups of similar information in your data

Finding associations among data items

Organizing data with biologically inspired clustering algorithms

There is so much data around us that it can feel overwhelming. Large amounts of information are constantly being generated, organized, analyzed, and stored. Data clustering is a process that can help you make sense of this flood of data by discovering hidden groupings of similar data items. Data clustering provides a description of your data that says, in essence, your data contains x number of groups of similar data objects.

Clustering — in the form of grouping similar things — is part of our daily activities. You use clustering any time you group similar items together. For example, when you store groceries in your fridge, you group the vegetables by themselves in the crisper, put frozen foods in their own section (the freezer), and so on. When you organize currency in your wallet, you arrange the bills by denomination — larger with larger, smaller with smaller. Clustering algorithms achieve this kind of order on a large scale for businesses or organizations — where datasets can comprise thousands or millions of data records associated with thousands of customers, suppliers, business partners, products, services, and so on.

In short, data clustering is an intelligent separation of data into groups of similar data items. The algorithms that ...

Get Predictive Analytics For Dummies, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.