Analyzing Tweets (One Entity at a Time)
CouchDB makes a great storage medium for collecting tweets because,
just like the email messages we looked at in Chapter 3,
they are conveniently represented as JSON-based documents and lend
themselves to map/reduce analysis with very little effort. Our next
example script harvests tweets from time lines, is relatively robust, and
should be easy to understand because all of the modules and much of the
code has already been introduced in earlier chapters. One subtle
consideration in reviewing it is that it uses a simple map/reduce job to
compute the maximum ID value for a tweet and passes this in as a query
constraint so as to avoid pulling duplicate data from Twitter’s API. See
the information associated with the
since_id parameter of the time line APIs for
It may also be informative to note that the maximum number of most recent tweets available from the user time line is around 3,200, while the home time line returns around 800 statuses; thus, it’s not very expensive (in terms of counting toward your rate limit) to pull all of the data that’s available. Perhaps not so intuitive when first interacting with the time line APIs is the fact that requests for data on the public time line only return 20 tweets, and those tweets are updated only every 60 seconds. To collect larger amounts of data you need to use the streaming API.
For example, if you wanted to learn a little more about Tim O’Reilly, “Silicon Valley’s favorite smart ...