By Kevin Hemenway, Tara Calishain
October 2003
Pages: 424
Series: Hacks
ISBN 10: 0-596-00577-6 |
ISBN 13: 9780596005771
![]()
![]()
![]()
![]()
(Average of 5 Customer Reviews)
Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content.
Full Description
- Aggregate and associate data from disparate locations, then store and manipulate the data as you like
- Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
- Integrate third-party data into your own applications or web sites
- Make your own site easier to scrape and more usable to others
- Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Register your book | Submit Errata | Examples
Browse within this book
Cover | Table of Contents | Index | Sample Hacks | Colophon
Book details
First Edition: October 2003
Series:
Hacks
ISBN: 0-596-00577-6
Pages: 424
Average Customer Reviews: ![]()
![]()
![]()
![]()
(Based on 5 Reviews)
Featured customer reviews
Good book, March 18 2007
Gives a lot of great ideas for spidering. Emphasis is on perl, with some occasionaly diversions into other languages for specific functions.
Personally, I'd prefer either a broader mix of languages, or restriction to one language. Still, overall a great book to give you a lot of ideas.
Spidering Hacks Review, February 10 2004
I enjoy the hacks series a load! The toys you can use immediately are great fun. I immediately borrowed the idea from the "automatically find blogs of your interest" chapter, and modified it to find "friends of friends" for a blog-happy girlfriend.
What I liked most about the book, is that it really broadened my perl horizons. Especially the section "building a toolkit". A great start to using some perl modules that help you get the job done -fast-.
Being someone who has built a variety of spiders/scrapers, I appreciated the insight from the authors, and appreciate finding the info in a consise condensed reference... something unknown to builders (and would-be builders) of crawlers in the past.
Spidering Hacks Review, January 15 2004
Spidering Hacks
Authors: Kevein Hemenway & Tara Calishain
Publisher: OReilly & Associates
Price: $24.95
Pages: 402
Web site: <http://www.oreilly.com/catalog/spiderhks/>
Reviewed by Bill Day,
Grand Rapids (Michigan) PerlMongers
4.5 stars (5 star scale). This book is not perfect, the authors may have tried to cover too much material. The material is very time sensitive, hence the book needed to be rushed together, it will have little value in 5 years. I wanted to give the book a higher rating, I tried to think of a better way to present the material in 400 pages and couldnt. There are just too many rough edges for a 5 star book.
As a member of OReillys Hacks series, Spidering Hacks is different than the typical OReilly book. This book presents breadth of topic rather than depth. The format is 100 hacks (mostly Perl on Linux with an odd Python, Java, or Windows hack), some written by Hemenway & Calishain, many written by guest authors organized into 6 chapters. The number of authors leads to a variety of styles in both English and Perl. If you treat the book as a super magazine (time sensitive short articles), you wont be disappointed.
Chapter 1 Walking Softly (Hacks 1-7)
Chapter 1 provides general guidelines on spider/scraper etiquette and good practices, which the rest of the book seems to ignore.
Chapter 2 Assembling a toolkit (Hacks 8-32)
An overview of several modules and techniques with working examples. More experienced Perl mongers may find this material remedial.
Chapter 3 Collecting media files (Hacks 33-42)
The hacks on POP3 attachments and Usenet may be worth the price of the book for those trying to solve a particular problem.
Chapter 4 Gleaning data from databases (Hacks 43-89)
Over ½ the book is dedicated to this chapter. Initially it appears that these are very specific solutions for a narrow audience. Closer reading reveals a variety of techniques that can be used in many circumstances.
Chapter 5 Maintaining your collections (Hacks 90-93)
Not much here. Cron is covered much better in other works.
Chapter 6 Giving back to the world (Hacks 94-100)
Essentially how to be nice to spiders. Why Net::AIM is covered here seems arbitrary. Hack #100 Going beyond the book is nothing but fluff.
An example of how I used the book may be illustrative. I wanted to scrape TV listings, but hack #73 Scraping TV Listings has been made obsolete by a modification to tvguide.com. I was able to quickly use the toolkit presented in chapter 2 to scrape one of the many other web sites with TV listings. I expect this to be typical, sites change, spiders and scrapers need to adapt.
Spider Hacks is an odd collection of articles that seem to cover the remedial to intermediate skill ranges. Nobody will benefit from all 100 hacks, but most of us will find $24.95 of value in the hacks that cause us to go How cool!.
Spidering Hacks Review, November 24 2003
I have been trying to find a Java book that offered me tips and tricks on how to scrape the Internet, glean the most tasty bits of it, and put them to good use. I ran across "Spidering Hacks", by Kevin Hemenway and Tara Calishain, which was exactly what I wanted - only it's base language is Perl.
To my delight, the authors' writing is so lucid, their support and encouragement so welcome, and their examples so closely matched to my needs - that I immediately picked up this book, and dove headlong into the vast and beautiful world that is Perl.
Despite my preference for programming in Java for Internet-related tasks, I highly recommend this book, even for those unfamiliar with the Perl programming language, as this book is written so well that you can get up and running purely on the strength of the authors' talents. I am very impressed with this book.
Kudos to the authors.
Spidering Hacks Review, November 19 2003
Excellent job in explaining the realworld solutions to data spidering, scraping and manipulation of the data. I have educated the Internet community about the positive benefits of bots for years and this book does an extraordinary job of giving industrial strength tips, tools and hacks highighted in a easy to understand format with concrete step by step instructions on the code, running the hack and hacking the hack. Great job Kevin and Tara!
Media reviews
"Lots of great ideas - 5 stars. Once in a long while you get a book that inspires you with a lot of great small ideas. Spidering Hacks is just that type of book...This book demonstrates everything I like in a technical book. It not only describes how things are done. It also gives practical examples of how the technology can be useful in the real world, and presents them enthusiastically. It makes you want to go out and implement all of the ideas and to keep on going with some of your own...Have to say, O'Reilly is on a roll with the Hacks series. They have all been fine books."
--Jack Herrington, Code Generation Network, March 2004
http://www.codegeneration.net/br_list.php?search=publisher&id=1
"'Spidering Hacks' is an example-filled, easy-to-follow, highly recommended computer shelf resource."
--Library Bookwatch, March 2004
"We keep recommending the O'Reilly Hacks series because these books are just so darned useful. This book is not about arachnids but about programmatically retrieving information from the Web. The focus here is on the Perl programming language, primarily because of the vast and useful collection of Perl tools that exist specifically for downloading and parsing Web content. Using the right modules, you can fetch the contents of an entire Web site in only a couple of lines worth of code. With a few more lines, you can parse that information and extract just the bit you need - say a stock quote, or a picture, or an array of Amazon links, or all the URLs on a page. In its collection of 100 tips and tricks, the book hits on just about every conceivable method of gathering and analyzing Web data. It's clearly a must for anybody who wants to automate the gathering of Web data at any level, from one-off Web spiders to complex, database-driven Web scraping applications."
--Netsurfer Digest, Volume 9 Issue 45, November 2003







