Chapter 8. Representing data

This chapter covers

  • Representing data as a Vector
  • Converting text documents into Vector form
  • Normalizing data representations

To get good clustering, you need to understand the techniques of vectorization: the process of representing objects as Vectors. A Vector is a very simplified representation of data that can help clustering algorithms understand the object and help compute its similarity with other objects. This chapter explores various ways of converting different kinds of objects into Vectors.

In the last chapter, you got a taste of clustering. Books were clustered together based on the similarity of their words, and points in a two-dimensional plane were clustered together based on the distances between ...

Get Mahout in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.