Clustering Contacts by Job Title

Now that you can (hopefully) better appreciate the nature of the record-matching problem, let’s collect some real-world data from LinkedIn and start hacking out clusters. A few reasons you might be interested in mining your LinkedIn data are because you want to take an honest look at whether your networking skills have been helping you to meet the “right kinds of people,” or because you want to approach contacts who will most likely fit into a certain socioeconomic bracket with a particular kind of business enquiry or proposition. The remainder of this section systematically works through some exercises in clustering your contacts by job title.

Standardizing and Counting Job Titles

An obvious starting point when working with a new data set is to count things, and this situation is no different. This section introduces a pattern for transforming common job titles and then performs a basic frequency analysis.

Assuming you have a reasonable number of exported contacts, the minor nuances among job titles that you’ll encounter may actually be surprising, but before we get into that, let’s put together some code that establishes some patterns for normalizing record data and takes a basic inventory sorted by frequency. Example 6-2 inspects job titles and prints out frequency information for the titles themselves and for individual tokens that occur in them. If you haven’t already, you’ll need to easy_install pretty table , a package that you can use to produce ...

Get Mining the Social Web now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.