Pulling results from Google Groups searches into a comma-delimited file.
It’s easy to look at the Internet and say it’s web pages, or it’s computers, or it’s networks. But look a little deeper and you’ll see that the core of the Internet is discussions: mailing lists, online forums, and even web sites, where people hold forth in glorious HTML, waiting for people to drop by, consider their philosophies, make contact, or buy their products and services.
Nowhere is the Internet-as-conversation idea more prevalent than in Usenet newsgroups. Google Groups has an archive of over 700 million messages from years of Usenet traffic. If you’re doing timely research, searching and saving Google Groups message pointers comes in really handy.
Because Google Groups is not searchable by the current version of the Google API, you can’t build an automated Google Groups query tool without violating Google’s TOS. However, you can scrape the HTML of a page you visit personally and save to your hard drive.
The first thing you need to do is run a Google Groups search. See the Google Groups [Hack #30] discussion for some hints on best practices for searching this message archive.
It’s best to put pages you’re going to scrape in order of date; that way if you’re going to scrape more pages later, it’s easy to look at them and check the last date the search results changed. Let’s say you’re trying to keep up with uses of Perl in programming the Google API; your query might look like this:
perl group:google.public.web-apis
On the righthand side of the results page is an option to sort either by relevance or date. Sort it by date. Your results page should look something like Figure 4-1.
Save this page to your hard drive, naming it something memorable like
groups.html
.
There are a couple of things to keep in mind when it comes to scraping pages, Google or not:
Scraping is brittle. A scraper is based on the way a page is formatted at the time the scraper is written. This means one minor change in the page, and things break down rather quickly.
There are myriad ways of scraping any particular page. This is just one of them, so experiment!
# groups2csv.pl # Google Groups results exported to CSV suitable for import into Excel # Usage: perl groups2csv.pl < groups.html > groups.csv # The CSV Header print qq{"title","url","group","date","author","number of articles"\n}; # The base URL for Google Groups my $url = "http://groups.google.com"; # Rake in those results my($results) = (join '', <>); # Perform a regular expression match to glean individual results while ( $results =~ m!<p><a href="?(.+?)"?>(.+?)</a><font size=-1(.+?)<br> <font color=green><a href=.+?>(.+?)</a>\s+-\s+(.+?)\s+by\s+(.+?)\s+-.+?\((\d+) articles!mgis ) { my($path, $title, $snippet, $group, $date, $author, $articles) = ($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||''); $title =~ s!"!""!g; # double escape " marks $title =~ s!<.+?>!!g; # drop all HTML tags print qq{"$title","$url$path","$group","$date","$author","$articles"\n}; }
Run the script from the command line, specifying the Google Groups
results filename you saved earlier and name of the CSV file you wish
to create or to which you wish to append additional results. For
example, using groups.html
as your input and
groups.csv
as your output:
$ perl groups2csv.pl < groups.html > groups.csv
Leaving off the >
and CSV filename sends the
results to the screen for your perusal.
Using a double >>
before the CSV filename
appends the current set of results to the CSV file, creating it if it
doesn’t already exist. This is useful for combining
more than one set of results, represented by more than one saved
results page:
$ perl groups2csv.pl < results_1.html > results.csv $ perl groups2csv.pl < results_2.html >> results.csv
Scraping the results of a search for perl group:google.public.web-apis
, anything mentioning the Perl
programming language on the Google APIs discussion forum, looks like:
$ perl groups2csv.pl < groups.html
"title","url","group","date","author","number of articles"
"Re: Perl Problem?",
"http://groups.google.com/groups?q=perl+group:google.public.
web-apis&hl=en&lr=&ie=UTF-8&output=search&selm=5533bb12.0208230215.
365a093d%40po sting.google.com&rnum=1",
"google.public.web-apis","Aug. 23, 2002","S Anand","2"
"Proxy usage from Perl script",
"http://groups.google.com/groups?q=perl+group:goo
gle.public.web-apis&hl=en&lr=&ie=UTF-8&output=search&selm=575db61f.
0206290446.1d fe4ea7%40posting.google.com&rnum=2",
"google.public.web-apis","Jun. 29, 2002","Varun","3"
...
"The Google Velocity",
"http://groups.google.com/groups?q=perl+group:google.public.web-apis&hl
=en&lr=&ie=UTF-8&output=search&selm=18a1ac72.0204221336.47fdee71%
40posting.google.com&rnum=29",
"google.public.web-apis","Apr. 22, 2002","John Graham-Cumming","2"
Get Google Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.