In principle, fetching Twitter data is dirt simple: make a request, store the response, and repeat as needed. But all sorts of real-world stuff gets in the way, such as network I/O, the infamous fail whale,[25] and those pesky API rate limits. Fortunately, it’s not too difficult to handle such issues, so long as you do a bit of forward planning and anticipate the things that could (and will) go wrong.
When executing a long-running program that’s eating away at your rate limit, writing robust code is especially important; you want to handle any exceptional conditions that could occur, do your best to remedy the situation, and—in the event that your best just isn’t good enough—save state and leave an indication of how to pick things back up where they left off. In other words, when you write data-harvesting code for a platform like Twitter, you must assume that it will throw curve balls at you. There will be atypical conditions you’ll have to handle, and they’re often more the norm than the exception.
The code we’ll develop is semi-rugged in that it deals with the most common things that can go wrong and is patterned so that you can easily extend it to handle new circumstances if they arise. That said, there are two specific HTTP errors you are highly likely to encounter when harvesting even modest amounts of Twitter data: a 401 Error (Not Authorized) and a 503 Error (Over Capacity). The former occurs when you attempt to access data that a user has protected, while the latter is basically unpredictable.
Whenever Twitter returns an HTTP error, the twitter
module throws a TwitterHTTP
Error
exception, which can be
handled like any other Python exception, with a try/except
block. Example 4-2 illustrates a minimal code
block that harvests some friend IDs and handles some of the more common
exceptional conditions.
Note
You’ll need to create a Twitter app in order to get a consumer key and secret that can be used with the Twitter examples in this book. It’s painless and only takes a moment.
Example 4-2. Using OAuth to authenticate and grab some friend data (friends_followers__get_friends.py)
# -*- coding: utf-8 -*- import sys import time import cPickle import twitter from twitter.oauth_dance import oauth_dance # Go to http://twitter.com/apps/new to create an app and get these items consumer_key = '' consumer_secret = '' SCREEN_NAME = sys.argv[1] friends_limit = 10000 (oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb', consumer_key, consumer_secret) t = twitter.Twitter(domain='api.twitter.com', api_version='1', auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret, consumer_key, consumer_secret)) ids = [] wait_period = 2 # secs cursor = -1 while cursor != 0: if wait_period > 3600: # 1 hour print 'Too many retries. Saving partial data to disk and exiting' f = file('%s.friend_ids' % str(cursor), 'wb') cPickle.dump(ids, f) f.close() exit() try: response = t.friends.ids(screen_name=SCREEN_NAME, cursor=cursor) ids.extend(response['ids']) wait_period = 2 except twitter.api.TwitterHTTPError, e: if e.e.code == 401: print 'Encountered 401 Error (Not Authorized)' print 'User %s is protecting their tweets' % (SCREEN_NAME, ) elif e.e.code in (502, 503): print 'Encountered %i Error. Trying again in %i seconds' % (e.e.code, wait_period) time.sleep(wait_period) wait_period *= 1.5 continue elif t.account.rate_limit_status()['remaining_hits'] == 0: status = t.account.rate_limit_status() now = time.time() # UTC when_rate_limit_resets = status['reset_time_in_seconds'] # UTC sleep_time = when_rate_limit_resets - now print 'Rate limit reached. Trying again in %i seconds' % (sleep_time, ) time.sleep(sleep_time) continue cursor = response['next_cursor'] print 'Fetched %i ids for %s' % (len(ids), SCREEN_NAME) if len(ids) >= friends_limit: break # do something interesting with the IDs print ids
The twitter.oauth
module provides
read_token_file
and write_token_file
convenience functions that can
be used to store and retrieve your OAuth token and OAuth token secret, so
you don’t have to manually enter in a PIN to authenticate each
time.
Note
In OAuth 2.0 parlance, “client” describes the same role as a
“consumer” in OAuth 1.0, thus the use of the variable names
consumer_key
and consumer_secret
in the
preceding listing.
There are several noteworthy items about the listing:
You can obtain your own
consumer_key
andconsumer_secret
by registering an application with Twitter at http://dev.twitter.com/apps/new. These two items, along with the credentials returned through the “OAuth dance,” are what enable you to provide an application with access to your account data (your friends list, in this particular example).The online documentation for Twitter’s social graph APIs states that requests for friend/follower data will return up to 5,000 IDs per call. In the event that there are more than 5,000 IDs to be returned, a cursor value that’s not equal to zero is returned that can be used to navigate forward to the next batch. This particular example “stops short” at a maximum of 10,000 ID values, but
friends_limit
could be an arbitrarily larger number.Given that the
/friends/ids
resource returns up to 5,000 IDs at a time, regular user accounts could retrieve up to 1,750,000 IDs before rate limiting would kick in based on a 350 requests/hour metric. While it might be an anomaly for a user to have that many friends on Twitter, it’s not at all uncommon for popular users to have many times that many followers.It’s not clear from any official documentation or the example code itself, but ID values in the results seem to be in reverse chronological order, so the first value will be the person you most recently followed, and the last value will be the first person you followed. Requests for followers via
t.followers.ids
appear to return results in the same order.
At this point, you’ve only been introduced to a few Twitter APIs. These are sufficiently powerful to answer a number of interesting questions about your account or any other nonprotected account, but there are numerous other APIs out there. We’ll look at some more of them shortly, but first, let’s segue into a brief interlude to refactor Example 4-2.
Given that virtually all interesting code listings involving
Twitter data will repeatedly involve performing the OAuth dance and
making robust requests that can stand up to the litany of things that
you have to assume might go wrong, it’s very worthwhile to establish a
pattern for performing these tasks. The approach that we’ll take is to
isolate the OAuth logic for login()
and makeTwitterRequest
functions so
that Example 4-2 looks like the
following refactored version of Example 4-3:
Example 4-3. Example 4-2 refactored to use two common utilities for OAuth and making API requests (friends_followers__get_friends_refactored.py)
# -*- coding: utf-8 -*- import sys import time import cPickle import twitter from twitter__login import login from twitter__util import makeTwitterRequest friends_limit = 10000 # You may need to setup your OAuth settings in twitter__login.py t = login() def getFriendIds(screen_name=None, user_id=None, friends_limit=10000): assert screen_name is not None or user_id is not None ids = [] cursor = -1 while cursor != 0: params = dict(cursor=cursor) if screen_name is not None: params['screen_name'] = screen_name else: params['user_id'] = user_id response = makeTwitterRequest(t, t.friends.ids, **params) ids.extend(response['ids']) cursor = response['next_cursor'] print >> sys.stderr, \ 'Fetched %i ids for %s' % (len(ids), screen_name or user_id) if len(ids) >= friends_limit: break return ids if __name__ == '__main__': ids = getFriendIds(sys.argv[1], friends_limit=10000) # do something interesting with the ids print ids
From here on out, we’ll continue to use
twitter__login
and
twitter__util
to keep the
examples as crisp and simple as possible. It’s worthwhile to take a
moment and peruse the source for these modules online before reading
further. They’ll appear again and again, and twitter__util
will soon come to have a number
of commonly used convenience functions in it.
The next section introduces Redis, a powerful data structures server that has quickly gained a fierce following due to its performance and simplicity.
As we’ve already observed, planning ahead is important when you want to execute a potentially long-running program to scarf down data from the Web, because lots of things can go wrong. But what do you do with all of that data once you get it? You may initially be tempted to just store it to disk. In the situation we’ve just been looking at, that might result in a directory structure similar to the following:
./ screen_name1/ friend_ids.json follower_ids.json user_info.json screen_name2/ ... ...
This looks pretty reasonable until you harvest all of the friends/followers for a very popular user—then, depending on your platform, you may be faced with a directory containing millions of subdirectories that’s relatively unusable because you can’t browse it very easily (if at all) in a terminal. Saving all this info to disk might also require that you maintain a registry of some sort that keeps track of all screen names, because the time required to generate a directory listing (in the event that you need one) for millions of files might not yield a desirable performance profile. If the app that uses the data then becomes threaded, you may end up with multiple writers needing to access the same file at the same time, so you’ll have to start dealing with file locking and such things. That’s probably not a place you want to go. All we really need in this case is a system that makes it trivially easy to store basic key/value pairs and a simple key encoding scheme—something like a disk-backed dictionary would be a good start. This next snippet demonstrates the construction of a key by concatenating a user ID, delimiter, and data structure name:
s = {} s["screen_name1$friend_ids"] = [1,2,3, ...] s["screen_name1$friend_ids"] # returns [1,2,3, ...]
But wouldn’t it be cool if the map could automatically compute set operations so that we could just tell it to do something like:
s.intersection("screen_name1$friend_ids", "screen_name1$follower_ids")
to automatically compute “mutual friends” for a Twitterer (i.e.,
to figure out which of their friends are following them back)? Well,
there’s an open source project called Redis that provides
exactly that kind of capability.
Redis is trivial to install, blazingly fast (written in C), scales well,
is actively maintained, and has a great Python client with accompanying
documentation available. Taking Redis for a test drive is as simple as
installing it and
starting up the server. (Windows users can save themselves some
headaches by grabbing a binary
that’s maintained by servicestack.net.) Then, just run
easy_install redis
to obtain a nice
Python client that provides trivial access to everything it has to
offer. For example, the previous snippet translates to the following
Redis code:
import redis r = redis.Redis(host='localhost', port=6379, db=0) # Default params [ r.sadd("screen_name1$friend_ids", i) for i in [1, 2, 3, ...] ] r.smembers("screen_name1$friend_ids") # Returns [1, 2, 3, ...]
Note that while sadd
and smembers
are
set-based operations, Redis includes operations specific to various
other types of data structures, such as sets, lists, and hashes. The set
operations turn out to be of particular interest because they provide
the answers to many of the questions posed at the beginning of this
chapter. It’s worthwhile to take a moment to review the documentation
for the Redis Python client to get a better appreciation of all it can
do. Recall that you can simply execute a command like pydoc redis.Redis
to quickly browse
documentation from a terminal.
Note
See “Redis: under the hood” for an awesome technical deep dive into how Redis works internally.
The most common set operations you’ll likely encounter are the
union, intersection, and difference operations. Recall that the
difference between a set and a list is that a set is unordered and
contains only unique members, while a list is ordered and may contain
duplicate members. As of Version 2.6, Python provides built-in support
for sets via the set
data structure.
Table 4-1 illustrates
some examples of common set operations for a trivially small universe of
discourse involving friends and followers:
Friends = {Abe, Bob}, Followers = {Bob, Carol}
Table 4-1. Sample set operations for Friends and Followers
Operation | Result | Comment |
---|---|---|
Friends ∪
Followers | Abe, Bob,
Carol | Someone’s overall network |
Friends ∩
Followers | Bob | Someone’s mutual friends |
Friends –
Followers | Abe | People a person is following, but who are not following that person back |
Followers –
Friends | Carol | People who are following someone but are not being followed back |
As previously mentioned, Redis provides native operations for computing common set operations. A few of the most relevant ones for the upcoming work at hand include:
smembers
Returns all of the members of a set
scard
Returns the cardinality of a set (the number of members in the set)
sinter
Computes the intersection for a list of sets
sdiff
Computes the difference for a list of sets
mget
Returns a list of string values for a list of keys
mset
Stores a list of string values against a list of keys
sadd
Adds an item to a set (and creates the set if it doesn’t already exist)
keys
Returns a list of keys matching a regex-like pattern
Skimming the pydoc for Python’s built-in set data type should convince you of the close mapping between it and the Redis APIs.
Redis should serve you well on your quest to efficiently process and analyze vast amounts of Twitter data for certain kinds of queries. Adapting Example 4-2 with some additional logic to house data in Redis requires only a simple change, and Example 4-4 is an update that computes some basic friend/follower statistics. Native functions in Redis are used to compute the set operations.
Example 4-4. Harvesting, storing, and computing statistics about friends and followers (friends_followers__friend_follower_symmetry.py)
# -*- coding: utf-8 -*- import sys import locale import time import functools import twitter import redis from twitter__login import login # A template-like function for maximizing code reuse, # which is essentially a wrapper around makeTwitterRequest # with some additional logic in place for interfacing with # Redis from twitter__util import _getFriendsOrFollowersUsingFunc # Creates a consistent key value for a user given a screen name from twitter__util import getRedisIdByScreenName SCREEN_NAME = sys.argv[1] MAXINT = sys.maxint # For nice number formatting locale.setlocale(locale.LC_ALL, '') # You may need to setup your OAuth settings in twitter__login.py t = login() # Connect using default settings for localhost r = redis.Redis() # Some wrappers around _getFriendsOrFollowersUsingFunc # that bind the first two arguments getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, t.friends.ids, 'friend_ids', t, r) getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc, t.followers.ids, 'follower_ids', t, r) screen_name = SCREEN_NAME # get the data print >> sys.stderr, 'Getting friends for %s...' % (screen_name, ) getFriends(screen_name, limit=MAXINT) print >> sys.stderr, 'Getting followers for %s...' % (screen_name, ) getFollowers(screen_name, limit=MAXINT) # use redis to compute the numbers n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids')) n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids')) n_friends_diff_followers = r.sdiffstore('temp', [getRedisIdByScreenName(screen_name, 'friend_ids'), getRedisIdByScreenName(screen_name, 'follower_ids')]) r.delete('temp') n_followers_diff_friends = r.sdiffstore('temp', [getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids')]) r.delete('temp') n_friends_inter_followers = r.sinterstore('temp', [getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids')]) r.delete('temp') print '%s is following %s' % (screen_name, locale.format('%d', n_friends, True)) print '%s is being followed by %s' % (screen_name, locale.format('%d', n_followers, True)) print '%s of %s are not following %s back' % (locale.format('%d', n_friends_diff_followers, True), locale.format('%d', n_friends, True), screen_name)print '%s of %s are not being followed back by %s' % (locale.format('%d', n_followers_diff_friends, True), locale.format('%d', n_followers, True), screen_name) print '%s has %s mutual friends' \ % (screen_name, locale.format('%d', n_friends_inter_followers, True))
Aside from the use of functools.partial
(http://docs.python.org/library/functools.html) to create
getFriends
and getFollowers
from a common piece of
parameter-bound code, Example 4-4 should be
pretty straightforward. There’s one other very subtle thing to notice:
there isn’t a call to
r.save
in Example 4-4, which means
that the settings in redis.conf
dictate when data is persisted to disk. By default, Redis stores data in
memory and asynchronously snapshots data to disk according to a schedule
that’s dictated by whether or not a number of changes have occurred
within a specified time interval. The risk with asynchronous writes is
that you might lose data if certain unexpected conditions, such as a
system crash or power outage, were to occur. Redis provides an “append
only” option that you can enable in redis.conf to hedge against this
possibility.
Note
It is highly recommended that you enable the
appendonly
option in redis.conf to protect against data loss; see
the “Append
Only File HOWTO” for helpful details.
Consider the following output, relating to Tim O’Reilly’s network of followers. Keeping in mind that there’s a rate limit of 350 OAuth requests per hour, you could expect this code to take a little less than an hour to run, because approximately 300 API calls would need to be made to collect all the follower ID values:
timoreilly is following 663 timoreilly is being followed by 1,423,704 131 of 633 are not following timoreilly back 1,423,172 of 1,423,704 are not being followed back by timoreilly timoreilly has 532 mutual friends
Note that while you could choose to settle for harvesting a smaller number of followers to avoid the rate limit‒imposed wait, the API documentation does not state that taking the first N pages’ worth of data would yield a truly random sample, and it appears that data is returned in reverse chronological order—so, you may not be able to extrapolate in a predictable way whether your logic depends on it. For example, if the first 10,000 followers returned just so happened to contain the 532 mutual friends, extrapolation from those points would result in a skewed analysis because these results are not at all representative of the larger population. For a very popular Twitterer such as Britney Spears, with well over 5,000,000 followers, somewhere in the neighborhood of 1,000 API calls would be required to fetch all of the followers over approximately a four-hour period. In general, the wait is probably worth it for this kind of data, and you could use the Twitter-streaming APIs to keep your data up-to-date so that you never have to go through the entire ordeal again.
Warning
One common source of error for some kinds of analysis is to forget about the overall size of a population relative to your sample. For example, randomly sampling 10,000 of Tim O’Reilly’s friends and followers would actually give you the full population of his friends, yet only a tiny fraction of his followers. Depending on the sophistication of your analysis, the sample size relative to the overall size of a population can make a difference in determining whether the outcome of an experiment is statistically significant, and the level of confidence you can have about it.
Given even these basic friend/follower stats, a couple of questions that lead us toward other interesting analyses naturally follow. For example, who are the 131 people who are not following Tim O’Reilly back? Given the various possibilities that could be considered about friends and followers, the “Who isn’t following me back?” question is one of the more interesting ones and arguably can provide a lot of insight about a person’s interests. So, how can we answer this question?
Staring at a list of user IDs isn’t very riveting, so resolving
those user IDs to actual user objects is the first obvious step. Example 4-5 extends Example 4-4 by
encapsulating common error-handling code into reusable form. It also
provides a function that demonstrates how to resolve those ID values to
screen names using the /users/lookup
API, which accepts a
list of up to 100 user IDs or screen names and returns the same basic
user information that you saw earlier with
/users/show
.
Example 4-5. Resolving basic user information such as screen names from IDs (friends_followers__get_user_info.py)
# -*- coding: utf-8 -*- import sys import json import redis from twitter__login import login # A makeTwitterRequest call through to the /users/lookup # resource, which accepts a comma separated list of up # to 100 screen names. Details are fairly uninteresting. # See also http://dev.twitter.com/doc/get/users/lookup from twitter__util import getUserInfo if __name__ == "__main__": screen_names = sys.argv[1:] t = login() r = redis.Redis() print json.dumps( getUserInfo(t, r, screen_names=screen_names), indent=4 )
Although not reproduced in its entirety, the getUserInfo
function that’s imported from
twitter__util
is essentially just a
makeTwitterRequest
to the
/users/lookup resource using a list of screen names. The following
snippet demonstrates:
def getUserInfo(t, r, screen_names): info = [] response = makeTwitterRequest(t, t.users.lookup, screen_name=','.join(screen_names) ) for user_info in response: r.set(getRedisIdByScreenName(user_info['screen_name'], 'info.json'), json.dumps(user_info)) r.set(getRedisIdByUserId(user_info['id'], 'info.json'), json.dumps(user_info)) info.extend(response) return info
It’s worthwhile to note that getUserInfo
stores the same user information
under two different keys: the user ID and the screen name. Storing both
of these keys allows us to easily look up a screen name given a user ID
value and a user ID value given a screen name. Translating a user ID
value to a screen name is a particularly useful operation since the
social graph APIs for getting friends and followers return only ID
values, which have no intuitive value until they are resolved against
screen names and other basic user information. While there is redundant
storage involved in this scheme, compared to other approaches, the
convenience is arguably worth it. Feel free to take a leaner approach if
storage is a concern.
An example user information object for Tim O’Reilly follows in Example 4-6, illustrating the kind of information available about Twitterers. The sky is the limit with what you can do with data that’s this rich. We won’t mine the user descriptions and tweets of the folks who aren’t following Tim back and put them in print, but you should have enough to work with should you wish to conduct that kind of analysis.
Example 4-6. Example user object represented as JSON data for Tim O’Reilly
{ "id": 2384071, "verified": true, "profile_sidebar_fill_color": "e0ff92", "profile_text_color": "000000", "followers_count": 1423326, "protected": false, "location": "Sebastopol, CA", "profile_background_color": "9ae4e8", "status": { "favorited": false, "contributors": null, "truncated": false, "text": "AWESOME!! RT @adafruit: a little girl asks after seeing adafruit ...", "created_at": "Sun May 30 00:56:33 +0000 2010", "coordinates": null, "source": "<a href=\"http://www.seesmic.com/\" rel=\"nofollow\">Seesmic</a>", "in_reply_to_status_id": null, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "place": null, "geo": null, "id": 15008936780 }, "utc_offset": -28800, "statuses_count": 11220, "description": "Founder and CEO, O'Reilly Media. Watching the alpha geeks...", "friends_count": 662, "profile_link_color": "0000ff", "profile_image_url": "http://a1.twimg.com/profile_images/941827802/IMG_...jpg", "notifications": false, "geo_enabled": true, "profile_background_image_url": "http://a1.twimg.com/profile_background_...gif", "name": "Tim O'Reilly", "lang": "en", "profile_background_tile": false, "favourites_count": 10, "screen_name": "timoreilly", "url": "http://radar.oreilly.com", "created_at": "Tue Mar 27 01:14:05 +0000 2007", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "87bc44", "following": false }
The refactored logic for handling HTTP errors and obtaining user
information in batches is provided in the following sections. Note that
the handleTwitterHTTPError
function
intentionally doesn’t include error handling for every conceivable error
case, because the action you may want to take will vary from situation
to situation. For example, in the event of a urllib2.URLError
(operation timed out) that is
triggered because someone unplugged your network cable, you want to
prompt the user for a specific course of action.
Example 4-5 brings to light some good news and some not-so-good news. The good news is that resolving the user IDs to user objects containing a byline, location information, the latest tweet, etc. is a treasure trove of information. The not-so-good news is that it’s quite expensive to do this in terms of rate limiting, given that you can only get data back in batches of 100. For Tim O’Reilly’s friends, that’s only seven API calls. For his followers, however, it’s over 14,000, which would take nearly two days to collect, given a rate limit of 350 calls per hour (and no glitches in harvesting).
However, given a full collection of anyone’s friends and followers
ID values, you can randomly sample
and calculate measures of statistical significance to your heart’s
content. Redis provides the srandmember
function that fits the bill
perfectly. You pass it the name of a set, such as timoreilly$follower_ids
, and it returns a
random member of that set.
Another piece of low-hanging fruit that we can go after is computing the friends and followers that two or more Twitterers have in common. Within a given universe, these folks might be interesting for a couple of reasons. One reason is that they’re the “common thread” connecting various disparate networks; you might interpret this to be a type of similarity metric. For example, if two users were both following a large number of the same people, you might conclude that those two users had very similar interests. From there, you might start to analyze the information embedded in the tweets of the common friends to gain more insight into what those people have in common, if anything, or make other conclusions. It turns out that computing common friends and followers is just a set operation away.
Example 4-7
illustrates the use of Redis’s sinterstore
function, which stores the result
of a set intersection, and introduces locale.format
for pretty-printing so that the
output is easier to read.
Example 4-7. Finding common friends/followers for multiple Twitterers, with output that’s easier on the eyes (friends_followers__friends_followers_in_common.py)
# -*- coding: utf-8 -*- import sys import redis from twitter__util import getRedisIdByScreenName # A pretty-print function for numbers from twitter__util import pp r = redis.Redis() def friendsFollowersInCommon(screen_names): r.sinterstore('temp$friends_in_common', [getRedisIdByScreenName(screen_name, 'friend_ids') for screen_name in screen_names] ) r.sinterstore('temp$followers_in_common', [getRedisIdByScreenName(screen_name, 'follower_ids') for screen_name in screen_names] ) print 'Friends in common for %s: %s' % (', '.join(screen_names), pp(r.scard('temp$friends_in_common'))) print 'Followers in common for %s: %s' % (', '.join(screen_names), pp(r.scard('temp$followers_in_common'))) # Clean up scratch workspace r.delete('temp$friends_in_common') r.delete('temp$followers_in_common') if __name__ == "__main__": if len(sys.argv) < 3: print >> sys.stderr, "Please supply at least two screen names." sys.exit(1) # Note: # The assumption is that the screen names you are # supplying have already been added to Redis. # See friends_followers__get_friends__refactored.py friendsFollowersInCommon(sys.argv[1:])
Note that although the values in the working sets are ID values,
you could easily use Redis’ randomkey
function to sample friends and followers, and use the getUserInfo
function from Example 4-5 to resolve useful
information such as screen names, most recent tweets, locations,
etc.
When someone shares information via a service such as Twitter,
it’s only natural to wonder how far the information penetrates into the
overall network by means of being retweeted. It should be fair to assume
that the more followers a person has, the greater the potential is for
that person’s tweets to be retweeted. Users who have a relatively high
overall percentage of their originally authored tweets retweeted can be
said to be more influential than users who are retweeted infrequently.
Users who have a relatively high percentage of their tweets retweeted,
even if they are not originally authored, might be said to be
mavens—people who are exceptionally well connected
and like to share information.[26] One trivial way to measure the relative influence of two
or more users is to simply compare their number of followers, since
every follower will have a direct view of their tweets. We already know
from Example 4-6 that we can get the number of
followers (and friends) for a user via the /users/lookup
and /users/show
APIs. Extracting that information from
these APIs is trivial enough:
for screen_name in screen_names: _json = json.loads(r.get(getRedisIdByScreenName(screen_name, "info.json"))) n_friends, n_followers = _json['friends_count'], _json['followers_count']
Counting numbers of followers is interesting, but there’s so much more that can be done. For example, a given user may not have the popularity of an information maven like Tim O’Reilly, but if you have him as a follower and he retweets you, you’ve suddenly tapped into a vast network of people who might just start to follow you once they’ve determined that you’re also interesting. Thus, a much better approach that you might take in calculating users’ potential influence is to not only compare their numbers of followers, but to spider out into the network a couple of levels. In fact, we can use the very same breadth-first approach that was introduced in Example 2-4.
Example 4-8 illustrates a
generalized crawl
function that
accepts a list of screen names, a crawl depth, and parameters that
control how many friends and followers to retrieve. The friends_limit
and followers_limit
parameters control how many
items to fetch from the social graph APIs (in batches of 5,000), while
friends_sample
and followers_sample
control how many user objects
to retrieve (in batches of 100). An updated function for getUserInfo
is also included to reflect the
pass-through of the sampling parameters.
Example 4-8. Crawling friends/followers connections (friends_followers__crawl.py)
# -*- coding: utf-8 -*- import sys import redis import functools from twitter__login import login from twitter__util import getUserInfo from twitter__util import _getFriendsOrFollowersUsingFunc SCREEN_NAME = sys.argv[1] t = login() r = redis.Redis() # Some wrappers around _getFriendsOrFollowersUsingFunc that # create convenience functions getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, t.friends.ids, 'friend_ids', t, r) getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc, t.followers.ids, 'follower_ids', t, r) def crawl( screen_names, friends_limit=10000, followers_limit=10000, depth=1, friends_sample=0.2, #XXX followers_sample=0.0, ): getUserInfo(t, r, screen_names=screen_names) for screen_name in screen_names: friend_ids = getFriends(screen_name, limit=friends_limit) follower_ids = getFollowers(screen_name, limit=followers_limit) friends_info = getUserInfo(t, r, user_ids=friend_ids, sample=friends_sample) followers_info = getUserInfo(t, r, user_ids=follower_ids, sample=followers_sample) next_queue = [u['screen_name'] for u in friends_info + followers_info] d = 1 while d < depth: d += 1 (queue, next_queue) = (next_queue, []) for _screen_name in queue: friend_ids = getFriends(_screen_name, limit=friends_limit) follower_ids = getFollowers(_screen_name, limit=followers_limit) next_queue.extend(friend_ids + follower_ids) # Note that this function takes a kw between 0.0 and 1.0 called # sample that allows you to crawl only a random sample of nodes # at any given level of the graph getUserInfo(user_ids=next_queue) if __name__ == '__main__': if len(sys.argv) < 2: print "Please supply at least one screen name." else: crawl([SCREEN_NAME]) # The data is now in the system. Do something interesting. For example, # find someone's most popular followers as an indiactor of potential influence. # See friends_followers__calculate_avg_influence_of_followers.py
Assuming you’ve run crawl
with
high enough numbers for friends_limit
and followers_limit
to get all of a
users’ friend IDs and follower IDs, all that remains is to take a large
enough random sample and calculate interesting metrics, such as the
average number of followers one level out. It could also be fun to look
at his top N followers to get an idea of who he
might be influencing. Example 4-9
demonstrates one possible approach that pulls the data out of Redis and
calculates Tim O’Reilly’s most popular followers.
Example 4-9. Calculating a Twitterer’s most popular followers (friends_followers__calculate_avg_influence_of_followers.py)
# -*- coding: utf-8 -*- import sys import json import locale import redis from prettytable import PrettyTable # Pretty printing numbers from twitter__util import pp # These functions create consistent keys from # screen names and user id values from twitter__util import getRedisIdByScreenName from twitter__util import getRedisIdByUserId SCREEN_NAME = sys.argv[1] locale.setlocale(locale.LC_ALL, '') def calculate(): r = redis.Redis() # Default connection settings on localhost follower_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME, 'follower_ids'))) followers = r.mget([getRedisIdByUserId(follower_id, 'info.json') for follower_id in follower_ids]) followers = [json.loads(f) for f in followers if f is not None] freqs = {} for f in followers: cnt = f['followers_count'] if not freqs.has_key(cnt): freqs[cnt] = [] freqs[cnt].append({'screen_name': f['screen_name'], 'user_id': f['id']}) # It could take a few minutes to calculate freqs, so store a snapshot for later use r.set(getRedisIdByScreenName(SCREEN_NAME, 'follower_freqs'), json.dumps(freqs)) keys = freqs.keys() keys.sort() print 'The top 10 followers from the sample:' fields = ['Date', 'Count'] pt = PrettyTable(fields=fields) [pt.set_field_align(f, 'l') for f in fields] for (user, freq) in reversed([(user['screen_name'], k) for k in keys[-10:] for user in freqs[k]]): pt.add_row([user, pp(freq)]) pt.printt() all_freqs = [k for k in keys for user in freqs[k]] avg = reduce(lambda x, y: x + y, all_freqs) / len(all_freqs) print "\nThe average number of followers for %s's followers: %s" \ % (SCREEN_NAME, pp(avg)) # psyco can only compile functions, so wrap code in a function try: import psyco psyco.bind(calculate) except ImportError, e: pass # psyco not installed calculate()
Note
In many common number-crunching situations, the psyco
module can dynamically
compile code and produce dramatic speed improvements. It’s totally
optional but definitely worth a hard look if you’re performing
calculations that take more than a few seconds.
Output follows for a sample size of about 150,000 (approximately 10%) of Tim O’Reilly’s followers. For statistical analysis, this high of a sample size relative to the population ensures a tiny margin of error and a very high confidence level.[27] That is, the results can be considered very representative, though not quite the same thing as the absolute truth about the population:
The top 10 followers from the sample: aplusk 4,993,072 BarackObama 4,114,901 mashable 2,014,615 MarthaStewart 1,932,321 Schwarzenegger 1,705,177 zappos 1,689,289 Veronica 1,612,827 jack 1,592,004 stephenfry 1,531,813 davos 1,522,621 The average number of followers for timoreilly's followers: 445
Interestingly, a few familiar names show up on the list, including some of the most popular Twitterers of all time: Ashton Kutcher (@aplusk), Barack Obama, Martha Stewart, and Arnold Schwarzenegger, among others. Removing these top 10 followers and recalculating lowers the average number of followers of Tim’s followers to approximately 284. Removing any follower with less than 10 followers of her own, however, dramatically increases the number to more than 1,000! Noting that there are tens of thousands of followers in this range and briefly perusing their profiles, however, does bring some reality into the situation: many of these users are spam accounts, users who are protecting their tweets, etc. Culling out the top 10 followers and all followers having fewer than 10 followers of their own might be a reasonable metric to work with; doing both of these things results in a number around 800, which is still quite high. There must be something to be said for the idea of getting retweeted by a popular Twitterer who has lots of connections to other popular Twitterers.
[25] Whenever Twitter goes over capacity, an HTTP 503 error is issued. In a browser, the error page displays an image of the now infamous “fail whale.” See http://twitter.com/503.
[26] See The Tipping Point by Malcolm Gladwell (Back Bay Books) for a great discourse on mavens.
[27] It’s about a 0.14 margin of error for a 99% confidence level.
Get Mining the Social Web now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.