A Lean, Mean Data-Collecting Machine

In principle, fetching Twitter data is dirt simple: make a request, store the response, and repeat as needed. But all sorts of real-world stuff gets in the way, such as network I/O, the infamous fail whale,^[25] and those pesky API rate limits. Fortunately, it’s not too difficult to handle such issues, so long as you do a bit of forward planning and anticipate the things that could (and will) go wrong.

When executing a long-running program that’s eating away at your rate limit, writing robust code is especially important; you want to handle any exceptional conditions that could occur, do your best to remedy the situation, and—in the event that your best just isn’t good enough—save state and leave an indication of how to pick things back up where they left off. In other words, when you write data-harvesting code for a platform like Twitter, you must assume that it will throw curve balls at you. There will be atypical conditions you’ll have to handle, and they’re often more the norm than the exception.

The code we’ll develop is semi-rugged in that it deals with the most common things that can go wrong and is patterned so that you can easily extend it to handle new circumstances if they arise. That said, there are two specific HTTP errors you are highly likely to encounter when harvesting even modest amounts of Twitter data: a 401 Error (Not Authorized) and a 503 Error (Over Capacity). The former occurs when you attempt to access data that a user has protected, while the latter is basically unpredictable.

Whenever Twitter returns an HTTP error, the twitter module throws a TwitterHTTP Error exception, which can be handled like any other Python exception, with a try/except block. Example 4-2 illustrates a minimal code block that harvests some friend IDs and handles some of the more common exceptional conditions.

Note

You’ll need to create a Twitter app in order to get a consumer key and secret that can be used with the Twitter examples in this book. It’s painless and only takes a moment.

Example 4-2. Using OAuth to authenticate and grab some friend data (friends_followers__get_friends.py)

# -*- coding: utf-8 -*-

import sys
import time
import cPickle
import twitter
from twitter.oauth_dance import oauth_dance

# Go to http://twitter.com/apps/new to create an app and get these items

consumer_key = ''
consumer_secret = ''

SCREEN_NAME = sys.argv[1]
friends_limit = 10000

(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
        consumer_key, consumer_secret)
t = twitter.Twitter(domain='api.twitter.com', api_version='1',
                    auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
                    consumer_key, consumer_secret))

ids = []
wait_period = 2  # secs
cursor = -1

while cursor != 0:
    if wait_period > 3600:  # 1 hour
        print 'Too many retries. Saving partial data to disk and exiting'
        f = file('%s.friend_ids' % str(cursor), 'wb')
        cPickle.dump(ids, f)
        f.close()
        exit()

    try:
        response = t.friends.ids(screen_name=SCREEN_NAME, cursor=cursor)
        ids.extend(response['ids'])
        wait_period = 2
    except twitter.api.TwitterHTTPError, e:
        if e.e.code == 401:
            print 'Encountered 401 Error (Not Authorized)'
            print 'User %s is protecting their tweets' % (SCREEN_NAME, )
        elif e.e.code in (502, 503):
            print 'Encountered %i Error. Trying again in %i seconds' % (e.e.code,
                    wait_period)
            time.sleep(wait_period)
            wait_period *= 1.5
            continue
        elif t.account.rate_limit_status()['remaining_hits'] == 0:
            status = t.account.rate_limit_status()
            now = time.time()  # UTC
            when_rate_limit_resets = status['reset_time_in_seconds']  # UTC
            sleep_time = when_rate_limit_resets - now
            print 'Rate limit reached. Trying again in %i seconds' % (sleep_time,
                    )
            time.sleep(sleep_time)
            continue

    cursor = response['next_cursor']
    print 'Fetched %i ids for %s' % (len(ids), SCREEN_NAME)
    if len(ids) >= friends_limit:
        break

# do something interesting with the IDs

print ids

The twitter.oauth module provides read_token_file and write_token_file convenience functions that can be used to store and retrieve your OAuth token and OAuth token secret, so you don’t have to manually enter in a PIN to authenticate each time.

Note

In OAuth 2.0 parlance, “client” describes the same role as a “consumer” in OAuth 1.0, thus the use of the variable names consumer_key and consumer_secret in the preceding listing.

There are several noteworthy items about the listing:

You can obtain your own consumer_key and consumer_secret by registering an application with Twitter at http://dev.twitter.com/apps/new. These two items, along with the credentials returned through the “OAuth dance,” are what enable you to provide an application with access to your account data (your friends list, in this particular example).
The online documentation for Twitter’s social graph APIs states that requests for friend/follower data will return up to 5,000 IDs per call. In the event that there are more than 5,000 IDs to be returned, a cursor value that’s not equal to zero is returned that can be used to navigate forward to the next batch. This particular example “stops short” at a maximum of 10,000 ID values, but friends_limit could be an arbitrarily larger number.
Given that the /friends/ids resource returns up to 5,000 IDs at a time, regular user accounts could retrieve up to 1,750,000 IDs before rate limiting would kick in based on a 350 requests/hour metric. While it might be an anomaly for a user to have that many friends on Twitter, it’s not at all uncommon for popular users to have many times that many followers.
It’s not clear from any official documentation or the example code itself, but ID values in the results seem to be in reverse chronological order, so the first value will be the person you most recently followed, and the last value will be the first person you followed. Requests for followers via t.followers.ids appear to return results in the same order.

At this point, you’ve only been introduced to a few Twitter APIs. These are sufficiently powerful to answer a number of interesting questions about your account or any other nonprotected account, but there are numerous other APIs out there. We’ll look at some more of them shortly, but first, let’s segue into a brief interlude to refactor Example 4-2.

A Very Brief Refactor Interlude

Given that virtually all interesting code listings involving Twitter data will repeatedly involve performing the OAuth dance and making robust requests that can stand up to the litany of things that you have to assume might go wrong, it’s very worthwhile to establish a pattern for performing these tasks. The approach that we’ll take is to isolate the OAuth logic for login() and makeTwitterRequest functions so that Example 4-2 looks like the following refactored version of Example 4-3:

Example 4-3. Example 4-2 refactored to use two common utilities for OAuth and making API requests (friends_followers__get_friends_refactored.py)

# -*- coding: utf-8 -*-

import sys
import time
import cPickle
import twitter
from twitter__login import login
from twitter__util import makeTwitterRequest 

friends_limit = 10000

# You may need to setup your OAuth settings in twitter__login.py
t = login()

def getFriendIds(screen_name=None, user_id=None, friends_limit=10000):
    assert screen_name is not None or user_id is not None

    ids = []
    cursor = -1
    while cursor != 0:
        params = dict(cursor=cursor)
        if screen_name is not None:
            params['screen_name'] = screen_name
        else:
            params['user_id'] = user_id

        response = makeTwitterRequest(t, t.friends.ids, **params)

        ids.extend(response['ids'])
        cursor = response['next_cursor']
        print >> sys.stderr, \
            'Fetched %i ids for %s' % (len(ids), screen_name or user_id)
        if len(ids) >= friends_limit:
            break

    return ids

if __name__ == '__main__':
    ids = getFriendIds(sys.argv[1], friends_limit=10000)

    # do something interesting with the ids
    print ids

From here on out, we’ll continue to use twitter__login and twitter__util to keep the examples as crisp and simple as possible. It’s worthwhile to take a moment and peruse the source for these modules online before reading further. They’ll appear again and again, and twitter__util will soon come to have a number of commonly used convenience functions in it.

The next section introduces Redis, a powerful data structures server that has quickly gained a fierce following due to its performance and simplicity.

Redis: A Data Structures Server

As we’ve already observed, planning ahead is important when you want to execute a potentially long-running program to scarf down data from the Web, because lots of things can go wrong. But what do you do with all of that data once you get it? You may initially be tempted to just store it to disk. In the situation we’ve just been looking at, that might result in a directory structure similar to the following:

./
screen_name1/
    friend_ids.json
    follower_ids.json
    user_info.json
screen_name2/
    ...
...

This looks pretty reasonable until you harvest all of the friends/followers for a very popular user—then, depending on your platform, you may be faced with a directory containing millions of subdirectories that’s relatively unusable because you can’t browse it very easily (if at all) in a terminal. Saving all this info to disk might also require that you maintain a registry of some sort that keeps track of all screen names, because the time required to generate a directory listing (in the event that you need one) for millions of files might not yield a desirable performance profile. If the app that uses the data then becomes threaded, you may end up with multiple writers needing to access the same file at the same time, so you’ll have to start dealing with file locking and such things. That’s probably not a place you want to go. All we really need in this case is a system that makes it trivially easy to store basic key/value pairs and a simple key encoding scheme—something like a disk-backed dictionary would be a good start. This next snippet demonstrates the construction of a key by concatenating a user ID, delimiter, and data structure name:

s = {}
s["screen_name1$friend_ids"] = [1,2,3, ...]
s["screen_name1$friend_ids"] # returns [1,2,3, ...]

But wouldn’t it be cool if the map could automatically compute set operations so that we could just tell it to do something like:

s.intersection("screen_name1$friend_ids", "screen_name1$follower_ids")

to automatically compute “mutual friends” for a Twitterer (i.e., to figure out which of their friends are following them back)? Well, there’s an open source project called Redis that provides exactly that kind of capability. Redis is trivial to install, blazingly fast (written in C), scales well, is actively maintained, and has a great Python client with accompanying documentation available. Taking Redis for a test drive is as simple as installing it and starting up the server. (Windows users can save themselves some headaches by grabbing a binary that’s maintained by servicestack.net.) Then, just run easy_install redis to obtain a nice Python client that provides trivial access to everything it has to offer. For example, the previous snippet translates to the following Redis code:

import redis

r = redis.Redis(host='localhost', port=6379, db=0) # Default params
[ r.sadd("screen_name1$friend_ids", i) for i in [1, 2, 3, ...] ]
r.smembers("screen_name1$friend_ids") # Returns [1, 2, 3, ...]

Note that while sadd and smembers are set-based operations, Redis includes operations specific to various other types of data structures, such as sets, lists, and hashes. The set operations turn out to be of particular interest because they provide the answers to many of the questions posed at the beginning of this chapter. It’s worthwhile to take a moment to review the documentation for the Redis Python client to get a better appreciation of all it can do. Recall that you can simply execute a command like pydoc redis.Redis to quickly browse documentation from a terminal.

Note

See “Redis: under the hood” for an awesome technical deep dive into how Redis works internally.

Elementary Set Operations

The most common set operations you’ll likely encounter are the union, intersection, and difference operations. Recall that the difference between a set and a list is that a set is unordered and contains only unique members, while a list is ordered and may contain duplicate members. As of Version 2.6, Python provides built-in support for sets via the set data structure. Table 4-1 illustrates some examples of common set operations for a trivially small universe of discourse involving friends and followers:

Friends = {Abe, Bob}, Followers = {Bob, Carol}

Table 4-1. Sample set operations for Friends and Followers

Operation	Result	Comment
`Friends ∪ Followers`	`Abe, Bob, Carol`	Someone’s overall network
`Friends ∩ Followers`	`Bob`	Someone’s mutual friends
`Friends – Followers`	`Abe`	People a person is following, but who are not following that person back
`Followers – Friends`	`Carol`	People who are following someone but are not being followed back

As previously mentioned, Redis provides native operations for computing common set operations. A few of the most relevant ones for the upcoming work at hand include:

smembers: Returns all of the members of a set
scard: Returns the cardinality of a set (the number of members in the set)
sinter: Computes the intersection for a list of sets
sdiff: Computes the difference for a list of sets
mget: Returns a list of string values for a list of keys
mset: Stores a list of string values against a list of keys
sadd: Adds an item to a set (and creates the set if it doesn’t already exist)
keys: Returns a list of keys matching a regex-like pattern

Skimming the pydoc for Python’s built-in set data type should convince you of the close mapping between it and the Redis APIs.

We’re Gonna Analyze Like It’s 1874

Although the concepts involved in set theory are as old as time itself, it is Georg Cantor who is generally credited with inventing set theory. His paper, “On a Characteristic Property of All Real Algebraic Numbers,” written in 1874, formalized set theory as part of his work on answering questions related to the concept of infinity. For example:

Are there more natural numbers (zero and the positive integers) than integers (positive and negative numbers)?
Are there more rational numbers (numbers that can be expressed as fractions) than integers?
Are there more irrational numbers (numbers that cannot be expressed as fractions, such as pi, √2, etc.) than rational numbers?

The gist of Cantor’s work around infinity as it relates to the first two questions is that the cardinalities of the sets of natural numbers, integers, and rational numbers are all equal because you can map these numbers such that they form a sequence with a definite starting point that extends forever in one direction. Even though there is never an ending point, the cardinalities of these sets are said to be countably infinite because there is a definite sequence that could be followed deterministically if you simply had enough time to count them. The cardinality of a countably infinite set became known by mathematicians as ℵ₀, an official definition of infinity. Consider the following numeric patterns that convey the basic idea behind countably infinite sets. Each pattern shows a starting point that can extend infinitely:

Natural numbers: 0, 1, 2, 3, 4, …
Positive integers: 1, 2, 3, 4, 5, …
Negative integers: ‒1, ‒2, ‒3, ‒4, ‒5, …
Integers: 0, 1, ‒1, 2, ‒2, 3, ‒3, 4, ‒4, …
Rational numbers: 0/0, 0/1, ‒1/1, ‒1/0, ‒1/-1, 0/‒1, 1/‒1, 1/0, 1/1, …

Notice that the pattern for the rational numbers is that you can start at the origin of the Cartesian plane and build out a spiral in which each x/y coordinate pair is expressed as a fraction, which is a rational number. (The two cases where it is undefined because of division by zero are of no consequence to the cardinality of the set as a whole.)

As it runs out, however, the cardinality of the set of irrational numbers is not equal to ℵ₀, because it is impossible to arrange them in such a way that they are countable and form a one-to-one correspondence back to a set having cardinality ℵ₀. Cantor used what became known as the famous diagonalization argument as the proof. The gist of the diagonalization proof is that just when you think you’ve mapped out a sequence that makes the irrational numbers countable, it can be shown that a whole slew of numbers are missing—and when you put these missing numbers into the sequence, there are still a slew of numbers missing. As it turns out, you can never form the one-to-one correspondence that’s necessary. Thus, the cardinality of the set of irrational numbers is not the same as the cardinalities of the sets of natural numbers, positive integers, and irrational numbers, because a one-to-one correspondence from one of these sets cannot be derived.

So what is the cardinality of the set of irrational numbers? It can be shown that the power set of the set having cardinality ℵ₀ is the cardinality of the set of all irrational numbers. This value is known as ℵ₁. Further, the power set of the set having cardinality ℵ₁ is known as ℵ₂, etc. Computing the power set of a set of infinite numbers is admittedly a difficult concept to wrap one’s head around, but it’s one well worth pondering when you’re having trouble sleeping at night.

Souping Up the Machine with Basic Friend/Follower Metrics

Redis should serve you well on your quest to efficiently process and analyze vast amounts of Twitter data for certain kinds of queries. Adapting Example 4-2 with some additional logic to house data in Redis requires only a simple change, and Example 4-4 is an update that computes some basic friend/follower statistics. Native functions in Redis are used to compute the set operations.

Example 4-4. Harvesting, storing, and computing statistics about friends and followers (friends_followers__friend_follower_symmetry.py)

# -*- coding: utf-8 -*-

import sys
import locale
import time
import functools
import twitter
import redis
from twitter__login import login

# A template-like function for maximizing code reuse,
# which is essentially a wrapper around makeTwitterRequest
# with some additional logic in place for interfacing with 
# Redis
from twitter__util import _getFriendsOrFollowersUsingFunc

# Creates a consistent key value for a user given a screen name
from twitter__util import getRedisIdByScreenName

SCREEN_NAME = sys.argv[1]

MAXINT = sys.maxint

# For nice number formatting
locale.setlocale(locale.LC_ALL, '')  

# You may need to setup your OAuth settings in twitter__login.py

t = login()

# Connect using default settings for localhost
r = redis.Redis()  

# Some wrappers around _getFriendsOrFollowersUsingFunc 
# that bind the first two arguments

getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, 
                               t.friends.ids, 'friend_ids', t, r)

getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
                                 t.followers.ids, 'follower_ids', t, r)

screen_name = SCREEN_NAME

# get the data

print >> sys.stderr, 'Getting friends for %s...' % (screen_name, )
getFriends(screen_name, limit=MAXINT)

print >> sys.stderr, 'Getting followers for %s...' % (screen_name, )
getFollowers(screen_name, limit=MAXINT)

# use redis to compute the numbers

n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids'))

n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))

n_friends_diff_followers = r.sdiffstore('temp',
                                        [getRedisIdByScreenName(screen_name,
                                        'friend_ids'),
                                        getRedisIdByScreenName(screen_name,
                                        'follower_ids')])
r.delete('temp')

n_followers_diff_friends = r.sdiffstore('temp',
                                        [getRedisIdByScreenName(screen_name,
                                        'follower_ids'),
                                        getRedisIdByScreenName(screen_name,
                                        'friend_ids')])
r.delete('temp')

n_friends_inter_followers = r.sinterstore('temp',
        [getRedisIdByScreenName(screen_name, 'follower_ids'),
        getRedisIdByScreenName(screen_name, 'friend_ids')])
r.delete('temp')

print '%s is following %s' % (screen_name, locale.format('%d', n_friends, True))
print '%s is being followed by %s' % (screen_name, locale.format('%d',
                                      n_followers, True))
print '%s of %s are not following %s back' % (locale.format('%d',
        n_friends_diff_followers, True), locale.format('%d', n_friends, True),
        screen_name)print '%s of %s are not being followed back by %s' % (locale.format('%d',
        n_followers_diff_friends, True), locale.format('%d', n_followers, True),
        screen_name)
print '%s has %s mutual friends' \
    % (screen_name, locale.format('%d', n_friends_inter_followers, True))

Aside from the use of functools.partial (http://docs.python.org/library/functools.html) to create getFriends and getFollowers from a common piece of parameter-bound code, Example 4-4 should be pretty straightforward. There’s one other very subtle thing to notice: there isn’t a call to r.save in Example 4-4, which means that the settings in redis.conf dictate when data is persisted to disk. By default, Redis stores data in memory and asynchronously snapshots data to disk according to a schedule that’s dictated by whether or not a number of changes have occurred within a specified time interval. The risk with asynchronous writes is that you might lose data if certain unexpected conditions, such as a system crash or power outage, were to occur. Redis provides an “append only” option that you can enable in redis.conf to hedge against this possibility.

Note

It is highly recommended that you enable the appendonly option in redis.conf to protect against data loss; see the “Append Only File HOWTO” for helpful details.

Consider the following output, relating to Tim O’Reilly’s network of followers. Keeping in mind that there’s a rate limit of 350 OAuth requests per hour, you could expect this code to take a little less than an hour to run, because approximately 300 API calls would need to be made to collect all the follower ID values:

timoreilly is following 663
timoreilly is being followed by 1,423,704
131 of 633 are not following timoreilly back
1,423,172 of 1,423,704 are not being followed back by timoreilly
timoreilly has 532 mutual friends

Note that while you could choose to settle for harvesting a smaller number of followers to avoid the rate limit‒imposed wait, the API documentation does not state that taking the first N pages’ worth of data would yield a truly random sample, and it appears that data is returned in reverse chronological order—so, you may not be able to extrapolate in a predictable way whether your logic depends on it. For example, if the first 10,000 followers returned just so happened to contain the 532 mutual friends, extrapolation from those points would result in a skewed analysis because these results are not at all representative of the larger population. For a very popular Twitterer such as Britney Spears, with well over 5,000,000 followers, somewhere in the neighborhood of 1,000 API calls would be required to fetch all of the followers over approximately a four-hour period. In general, the wait is probably worth it for this kind of data, and you could use the Twitter-streaming APIs to keep your data up-to-date so that you never have to go through the entire ordeal again.

Warning

One common source of error for some kinds of analysis is to forget about the overall size of a population relative to your sample. For example, randomly sampling 10,000 of Tim O’Reilly’s friends and followers would actually give you the full population of his friends, yet only a tiny fraction of his followers. Depending on the sophistication of your analysis, the sample size relative to the overall size of a population can make a difference in determining whether the outcome of an experiment is statistically significant, and the level of confidence you can have about it.

Given even these basic friend/follower stats, a couple of questions that lead us toward other interesting analyses naturally follow. For example, who are the 131 people who are not following Tim O’Reilly back? Given the various possibilities that could be considered about friends and followers, the “Who isn’t following me back?” question is one of the more interesting ones and arguably can provide a lot of insight about a person’s interests. So, how can we answer this question?

Staring at a list of user IDs isn’t very riveting, so resolving those user IDs to actual user objects is the first obvious step. Example 4-5 extends Example 4-4 by encapsulating common error-handling code into reusable form. It also provides a function that demonstrates how to resolve those ID values to screen names using the /users/lookup API, which accepts a list of up to 100 user IDs or screen names and returns the same basic user information that you saw earlier with /users/show.

Example 4-5. Resolving basic user information such as screen names from IDs (friends_followers__get_user_info.py)

# -*- coding: utf-8 -*-

import sys
import json
import redis
from twitter__login import login

# A makeTwitterRequest call through to the /users/lookup 
# resource, which accepts a comma separated list of up 
# to 100 screen names. Details are fairly uninteresting. 
# See also http://dev.twitter.com/doc/get/users/lookup
from twitter__util import getUserInfo

if __name__ == "__main__":
    screen_names = sys.argv[1:]

    t = login()
    r = redis.Redis()

    print json.dumps(
            getUserInfo(t, r, screen_names=screen_names),
            indent=4
          )

Although not reproduced in its entirety, the getUserInfo function that’s imported from twitter__util is essentially just a makeTwitterRequest to the /users/lookup resource using a list of screen names. The following snippet demonstrates:

def getUserInfo(t, r, screen_names):
    info = []
    response = makeTwitterRequest(t, 
                                  t.users.lookup,
                                  screen_name=','.join(screen_names)
                                 )

    for user_info in response:
        r.set(getRedisIdByScreenName(user_info['screen_name'], 'info.json'),
              json.dumps(user_info))
        r.set(getRedisIdByUserId(user_info['id'], 'info.json'), 
              json.dumps(user_info))

        info.extend(response)

    return info

It’s worthwhile to note that getUserInfo stores the same user information under two different keys: the user ID and the screen name. Storing both of these keys allows us to easily look up a screen name given a user ID value and a user ID value given a screen name. Translating a user ID value to a screen name is a particularly useful operation since the social graph APIs for getting friends and followers return only ID values, which have no intuitive value until they are resolved against screen names and other basic user information. While there is redundant storage involved in this scheme, compared to other approaches, the convenience is arguably worth it. Feel free to take a leaner approach if storage is a concern.

An example user information object for Tim O’Reilly follows in Example 4-6, illustrating the kind of information available about Twitterers. The sky is the limit with what you can do with data that’s this rich. We won’t mine the user descriptions and tweets of the folks who aren’t following Tim back and put them in print, but you should have enough to work with should you wish to conduct that kind of analysis.

Example 4-6. Example user object represented as JSON data for Tim O’Reilly

{
    "id": 2384071,
    "verified": true,
    "profile_sidebar_fill_color": "e0ff92",
    "profile_text_color": "000000",
    "followers_count": 1423326,
    "protected": false,
    "location": "Sebastopol, CA",
    "profile_background_color": "9ae4e8",
    "status": {
        "favorited": false,
        "contributors": null,
        "truncated": false,
        "text": "AWESOME!! RT @adafruit: a little girl asks after seeing adafruit ...",
        "created_at": "Sun May 30 00:56:33 +0000 2010",
        "coordinates": null,
        "source": "<a href=\"http://www.seesmic.com/\" rel=\"nofollow\">Seesmic</a>",
        "in_reply_to_status_id": null,
        "in_reply_to_screen_name": null,
        "in_reply_to_user_id": null,
        "place": null,
        "geo": null,
        "id": 15008936780
    },
    "utc_offset": -28800,
    "statuses_count": 11220,
    "description": "Founder and CEO, O'Reilly Media. Watching the alpha geeks...",
    "friends_count": 662,
    "profile_link_color": "0000ff",
    "profile_image_url": "http://a1.twimg.com/profile_images/941827802/IMG_...jpg",
    "notifications": false,
    "geo_enabled": true,
    "profile_background_image_url": "http://a1.twimg.com/profile_background_...gif",
    "name": "Tim O'Reilly",
    "lang": "en",
    "profile_background_tile": false,
    "favourites_count": 10,
    "screen_name": "timoreilly",
    "url": "http://radar.oreilly.com",
    "created_at": "Tue Mar 27 01:14:05 +0000 2007",
    "contributors_enabled": false,
    "time_zone": "Pacific Time (US & Canada)",
    "profile_sidebar_border_color": "87bc44",
    "following": false
}

The refactored logic for handling HTTP errors and obtaining user information in batches is provided in the following sections. Note that the handleTwitterHTTPError function intentionally doesn’t include error handling for every conceivable error case, because the action you may want to take will vary from situation to situation. For example, in the event of a urllib2.URLError (operation timed out) that is triggered because someone unplugged your network cable, you want to prompt the user for a specific course of action.

Example 4-5 brings to light some good news and some not-so-good news. The good news is that resolving the user IDs to user objects containing a byline, location information, the latest tweet, etc. is a treasure trove of information. The not-so-good news is that it’s quite expensive to do this in terms of rate limiting, given that you can only get data back in batches of 100. For Tim O’Reilly’s friends, that’s only seven API calls. For his followers, however, it’s over 14,000, which would take nearly two days to collect, given a rate limit of 350 calls per hour (and no glitches in harvesting).

However, given a full collection of anyone’s friends and followers ID values, you can randomly sample and calculate measures of statistical significance to your heart’s content. Redis provides the srandmember function that fits the bill perfectly. You pass it the name of a set, such as timoreilly$follower_ids, and it returns a random member of that set.

Calculating Similarity by Computing Common Friends and Followers

Another piece of low-hanging fruit that we can go after is computing the friends and followers that two or more Twitterers have in common. Within a given universe, these folks might be interesting for a couple of reasons. One reason is that they’re the “common thread” connecting various disparate networks; you might interpret this to be a type of similarity metric. For example, if two users were both following a large number of the same people, you might conclude that those two users had very similar interests. From there, you might start to analyze the information embedded in the tweets of the common friends to gain more insight into what those people have in common, if anything, or make other conclusions. It turns out that computing common friends and followers is just a set operation away.

Example 4-7 illustrates the use of Redis’s sinterstore function, which stores the result of a set intersection, and introduces locale.format for pretty-printing so that the output is easier to read.

Example 4-7. Finding common friends/followers for multiple Twitterers, with output that’s easier on the eyes (friends_followers__friends_followers_in_common.py)

# -*- coding: utf-8 -*-

import sys
import redis

from twitter__util import getRedisIdByScreenName

# A pretty-print function for numbers
from twitter__util import pp

r = redis.Redis()

def friendsFollowersInCommon(screen_names):
    r.sinterstore('temp$friends_in_common', 
                  [getRedisIdByScreenName(screen_name, 'friend_ids') 
                      for screen_name in screen_names]
                 )

    r.sinterstore('temp$followers_in_common',
                  [getRedisIdByScreenName(screen_name, 'follower_ids')
                      for screen_name in screen_names]
                 )

    print 'Friends in common for %s: %s' % (', '.join(screen_names),
            pp(r.scard('temp$friends_in_common')))

    print 'Followers in common for %s: %s' % (', '.join(screen_names),
            pp(r.scard('temp$followers_in_common')))

    # Clean up scratch workspace

    r.delete('temp$friends_in_common')
    r.delete('temp$followers_in_common')

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print >> sys.stderr, "Please supply at least two screen names."
        sys.exit(1)

    # Note:
    # The assumption is that the screen names you are 
    # supplying have already been added to Redis.
    # See friends_followers__get_friends__refactored.py

    friendsFollowersInCommon(sys.argv[1:])

Note that although the values in the working sets are ID values, you could easily use Redis’ randomkey function to sample friends and followers, and use the getUserInfo function from Example 4-5 to resolve useful information such as screen names, most recent tweets, locations, etc.

Measuring Influence

When someone shares information via a service such as Twitter, it’s only natural to wonder how far the information penetrates into the overall network by means of being retweeted. It should be fair to assume that the more followers a person has, the greater the potential is for that person’s tweets to be retweeted. Users who have a relatively high overall percentage of their originally authored tweets retweeted can be said to be more influential than users who are retweeted infrequently. Users who have a relatively high percentage of their tweets retweeted, even if they are not originally authored, might be said to be mavens—people who are exceptionally well connected and like to share information.^[26] One trivial way to measure the relative influence of two or more users is to simply compare their number of followers, since every follower will have a direct view of their tweets. We already know from Example 4-6 that we can get the number of followers (and friends) for a user via the /users/lookup and /users/show APIs. Extracting that information from these APIs is trivial enough:

for screen_name in screen_names:
    _json = json.loads(r.get(getRedisIdByScreenName(screen_name, "info.json")))
    n_friends, n_followers = _json['friends_count'], _json['followers_count']

Counting numbers of followers is interesting, but there’s so much more that can be done. For example, a given user may not have the popularity of an information maven like Tim O’Reilly, but if you have him as a follower and he retweets you, you’ve suddenly tapped into a vast network of people who might just start to follow you once they’ve determined that you’re also interesting. Thus, a much better approach that you might take in calculating users’ potential influence is to not only compare their numbers of followers, but to spider out into the network a couple of levels. In fact, we can use the very same breadth-first approach that was introduced in Example 2-4.

Example 4-8 illustrates a generalized crawl function that accepts a list of screen names, a crawl depth, and parameters that control how many friends and followers to retrieve. The friends_limit and followers_limit parameters control how many items to fetch from the social graph APIs (in batches of 5,000), while friends_sample and followers_sample control how many user objects to retrieve (in batches of 100). An updated function for getUserInfo is also included to reflect the pass-through of the sampling parameters.

Example 4-8. Crawling friends/followers connections (friends_followers__crawl.py)

# -*- coding: utf-8 -*-

import sys
import redis
import functools
from twitter__login import login
from twitter__util import getUserInfo
from twitter__util import _getFriendsOrFollowersUsingFunc

SCREEN_NAME = sys.argv[1]

t = login()
r = redis.Redis()

# Some wrappers around _getFriendsOrFollowersUsingFunc that 
# create convenience functions

getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, 
                               t.friends.ids, 'friend_ids', t, r)
getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
                                 t.followers.ids, 'follower_ids', t, r)

def crawl(
    screen_names,
    friends_limit=10000,
    followers_limit=10000,
    depth=1,
    friends_sample=0.2, #XXX
    followers_sample=0.0,
    ):

    getUserInfo(t, r, screen_names=screen_names)
    for screen_name in screen_names:
        friend_ids = getFriends(screen_name, limit=friends_limit)
        follower_ids = getFollowers(screen_name, limit=followers_limit)

        friends_info = getUserInfo(t, r, user_ids=friend_ids, 
                                   sample=friends_sample)

        followers_info = getUserInfo(t, r, user_ids=follower_ids,
                                     sample=followers_sample)

        next_queue = [u['screen_name'] for u in friends_info + followers_info]

        d = 1
        while d < depth:
            d += 1
            (queue, next_queue) = (next_queue, [])
            for _screen_name in queue:
                friend_ids = getFriends(_screen_name, limit=friends_limit)
                follower_ids = getFollowers(_screen_name, limit=followers_limit)

                next_queue.extend(friend_ids + follower_ids)

                # Note that this function takes a kw between 0.0 and 1.0 called
                # sample that allows you to crawl only a random sample of nodes
                # at any given level of the graph

                getUserInfo(user_ids=next_queue)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Please supply at least one screen name."
    else:
        crawl([SCREEN_NAME])

        # The data is now in the system. Do something interesting. For example, 
        # find someone's most popular followers as an indiactor of potential influence.
        # See friends_followers__calculate_avg_influence_of_followers.py

Assuming you’ve run crawl with high enough numbers for friends_limit and followers_limit to get all of a users’ friend IDs and follower IDs, all that remains is to take a large enough random sample and calculate interesting metrics, such as the average number of followers one level out. It could also be fun to look at his top N followers to get an idea of who he might be influencing. Example 4-9 demonstrates one possible approach that pulls the data out of Redis and calculates Tim O’Reilly’s most popular followers.

Example 4-9. Calculating a Twitterer’s most popular followers (friends_followers__calculate_avg_influence_of_followers.py)

# -*- coding: utf-8 -*-

import sys
import json
import locale
import redis
from prettytable import PrettyTable

# Pretty printing numbers
from twitter__util import pp 

# These functions create consistent keys from 
# screen names and user id values
from twitter__util import getRedisIdByScreenName 
from twitter__util import getRedisIdByUserId

SCREEN_NAME = sys.argv[1]

locale.setlocale(locale.LC_ALL, '')

def calculate():
    r = redis.Redis()  # Default connection settings on localhost

    follower_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME,
                        'follower_ids')))

    followers = r.mget([getRedisIdByUserId(follower_id, 'info.json')
                       for follower_id in follower_ids])
    followers = [json.loads(f) for f in followers if f is not None]

    freqs = {}
    for f in followers:
        cnt = f['followers_count']
        if not freqs.has_key(cnt):
            freqs[cnt] = []

        freqs[cnt].append({'screen_name': f['screen_name'], 'user_id': f['id']})

    # It could take a few minutes to calculate freqs, so store a snapshot for later use

    r.set(getRedisIdByScreenName(SCREEN_NAME, 'follower_freqs'),
          json.dumps(freqs))

    keys = freqs.keys()
    keys.sort()

    print 'The top 10 followers from the sample:'

    fields = ['Date', 'Count']
    pt = PrettyTable(fields=fields)
    [pt.set_field_align(f, 'l') for f in fields]

    for (user, freq) in reversed([(user['screen_name'], k) for k in keys[-10:]
                                    for user in freqs[k]]):
        pt.add_row([user, pp(freq)])

    pt.printt()

    all_freqs = [k for k in keys for user in freqs[k]]
    avg = reduce(lambda x, y: x + y, all_freqs) / len(all_freqs)

    print "\nThe average number of followers for %s's followers: %s" \
        % (SCREEN_NAME, pp(avg))

# psyco can only compile functions, so wrap code in a function

try:
    import psyco
    psyco.bind(calculate)
except ImportError, e:
    pass  # psyco not installed

calculate()

Note

In many common number-crunching situations, the psyco module can dynamically compile code and produce dramatic speed improvements. It’s totally optional but definitely worth a hard look if you’re performing calculations that take more than a few seconds.

Output follows for a sample size of about 150,000 (approximately 10%) of Tim O’Reilly’s followers. For statistical analysis, this high of a sample size relative to the population ensures a tiny margin of error and a very high confidence level.^[27] That is, the results can be considered very representative, though not quite the same thing as the absolute truth about the population:

The top 10 followers from the sample:
aplusk 4,993,072
BarackObama 4,114,901
mashable 2,014,615
MarthaStewart 1,932,321
Schwarzenegger 1,705,177
zappos 1,689,289
Veronica 1,612,827
jack 1,592,004
stephenfry 1,531,813
davos 1,522,621

The average number of followers for timoreilly's followers: 445

Interestingly, a few familiar names show up on the list, including some of the most popular Twitterers of all time: Ashton Kutcher (@aplusk), Barack Obama, Martha Stewart, and Arnold Schwarzenegger, among others. Removing these top 10 followers and recalculating lowers the average number of followers of Tim’s followers to approximately 284. Removing any follower with less than 10 followers of her own, however, dramatically increases the number to more than 1,000! Noting that there are tens of thousands of followers in this range and briefly perusing their profiles, however, does bring some reality into the situation: many of these users are spam accounts, users who are protecting their tweets, etc. Culling out the top 10 followers and all followers having fewer than 10 followers of their own might be a reasonable metric to work with; doing both of these things results in a number around 800, which is still quite high. There must be something to be said for the idea of getting retweeted by a popular Twitterer who has lots of connections to other popular Twitterers.

^[25]Whenever Twitter goes over capacity, an HTTP 503 error is issued. In a browser, the error page displays an image of the now infamous “fail whale.” See http://twitter.com/503.

^[26]See The Tipping Point by Malcolm Gladwell (Back Bay Books) for a great discourse on mavens.

^[27]It’s about a 0.14 margin of error for a 99% confidence level.

Get Mining the Social Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mining the Social Web by Matthew A. Russell

A Lean, Mean Data-Collecting Machine

Note

Note

A Very Brief Refactor Interlude

Redis: A Data Structures Server

Note

Elementary Set Operations

Souping Up the Machine with Basic Friend/Follower Metrics

Note

Warning

Calculating Similarity by Computing Common Friends and Followers

Measuring Influence

Note

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly