Chapter 8. Representing data

This chapter covers

  • Representing data as a Vector
  • Converting text documents into Vector form
  • Normalizing data representations

To get good clustering, you need to understand the techniques of vectorization: the process of representing objects as Vectors. A Vector is a very simplified representation of data that can help clustering algorithms understand the object and help compute its similarity with other objects. This chapter explores various ways of converting different kinds of objects into Vectors.

In the last chapter, you got a taste of clustering. Books were clustered together based on the similarity of their words, and points in a two-dimensional plane were clustered together based on the distances between ...

Get Mahout in Action now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.