book

Spidering Hacks

by Morbus Iff, Tara Calishain

October 2003

Beginner to intermediate

428 pages

11h 9m

English

O'Reilly Media, Inc.

Read now

Unlock full access

A Note Regarding Supplemental Files
Credits
About the AuthorsContributorsAcknowledgmentsKevinTara
Preface
Why Spidering Hacks?How This Book Is OrganizedHow to Use This BookConventions Used in This BookHow to Contact UsGot a Hack?
1. Walking Softly
Hack 1. Hacks #1-7Hack #1. A Crash Course in Spidering and ScrapingWhy Spider?Hack #2. Best Practices for You and Your SpiderBe Liberal in What You AcceptDon’t Limit Your DatasetDon’t Reinvent the WheelBest Practices for YouChoose the most structured format availableIf you must scrape HTML, do so sparinglyUse the right tool for the jobDon’t go where you’re not wantedChoose a good identifierMake information on your spider readily availableDon’t demand unlimited site access or supportBest Practices for Your SpiderRespect robots.txtGo light on the bandwidthTake just enough, and don’t take too oftenHack #3. Anatomy of an HTML PageAnatomy of an HTML PageHeader Information with the H TagsList Information with Special HTML TagsNon-HTML FilesHack #4. Registering Your SpiderNaming Your SpiderA Web Page About Your SpiderPlaces to Register Your SpiderHack #5. Preempting DiscoveryMaking ContactMaking the Arguments for Your SpiderMaking Your Spider Easy to Find and Learn AboutConsidering Legal IssuesHack #6. Keeping Your Spider Out of Sticky SituationsBad Spider, No Biscuit!Violating CopyrightAggregating DataCompetitive IntelligencePossible Consequences of Misbehaving SpidersTracking Legal IssuesHack #7. Finding the Patterns of IdentifiersArbitrary Classification Systems Within a CollectionClassification Systems that Use an Established Universal Taxonomy Within a CollectionClassification Systems that Identify Documents Across a Wide Number of CollectionsSome Large Collections with ID Numbers
2. Assembling a Toolbox
Hack 9. Hacks #8-32Hack 10. Perl ModulesHack 11. Resources You May Find HelpfulHack #8. Installing Perl ModulesExample: Installing LWPUnix and Mac OS X installation via CPANUnix and Mac OS X installation by handWindows installation via PPMHack #9. Simply Fetching with LWP::SimpleHack #10. More Involved Requests with LWP::UserAgentHack #11. Adding HTTP Headers to Your RequestHack #12. Posting Form Data with LWPHack #13. Authentication, Cookies, and ProxiesAuthenticationEnabling CookiesUsing ProxiesHack #14. Handling Relative and Absolute URLsHack #15. Secured Access and Browser AttributesOther Browser AttributesHack #16. Respecting Your Scrapee’s BandwidthIf-Modified-SinceETagsCompressed DataHack #17. Respecting robots.txtHack #18. Adding Progress Bars to Your ScriptsThe CodeHack #19. Scraping with HTML::TreeBuilderHacking the HackHack #20. Parsing with HTML::TokeParserThe CodeRunning the HackSee AlsoHack #21. WWW::Mechanize 101Introducing WWW::MechanizeUsing Mech’s Navigation ToolsThe CodeRunning the HackHack #22. Scraping with WWW::MechanizeThe CodeRunning the HackHack #23. In Praise of Regular ExpressionsUsing Modules to Parse HTMLWatching the Printers: Score One for Regular ExpressionsThe CodeNot Fragile, but Probably Not Permanent EitherHack #24. Painless RSS with Template::ExtractHack #25. A Quick Introduction to XPathUsing LibXML’s xmllintThe CodeRunning the HackHack #26. Downloading with curl and wgetHack #27. More Advanced wget TechniquesHack #28. Using Pipes to Chain CommandsBrowsing for Links with lynxgrepping for Patternswgetting the FilesHacking the HackHack #29. Running Multiple Utilities at OnceShell ScriptsPerl EquivalenceHack #30. Utilizing the Web Scraping ProxyThe CodeRunning the HackHacking the HackHack #31. Being Warned When Things Go WrongHack #32. Being Adaptive to Site Redesigns
3. Collecting Media Files
Hack 37. Hacks #33-42Hack #33. Detective Case Study: NewgroundsThe CodeRunning the HackHacking the HackHack #34. Detective Case Study: iFilmThe CodeRunning the HackHack #35. Downloading Movies from the Library of CongressDirectory IndexesAn Example: Origins of American AnimationAnother Example: America at Work, America at LeisureHack #36. Downloading Images from WebshotsThe CodeRunning the HackHacking the HackStarting on a given pageDownloading from other areasModifying filenamesBypassing the adult content warningHack #37. Downloading Comics with dailystripsGetting the CodeRunning the HackHacking the HackDefining strips by URLFinding strips with a searchGathering strips into a groupHack #38. Archiving Your Favorite WebcamsThe CodeRunning the HackHacking the HackHack #39. News Wallpaper for Your SiteThe CodeRunning the HackHacking the HackPicture limitsRSS versionImage::SizeHack #40. Saving Only POP3 Email AttachmentsThe CodeRunning the HackHacking the HackChanging the hardcoded file extensionsShortening or eliminating the subject lineSaving attachments to the current directorySpecifying the size of saved messagesHack #41. Downloading MP3s from a PlaylistThe CodeRunning the HackHacking the HackHack #42. Downloading from Usenet with nget
4. Gleaning Data from Databases
Hack 48. Hacks #43-89Hack #43. Archiving Yahoo! Groups Messages with yahoo2mboxRunning the HackHacking the HackHack #44. Archiving Yahoo! Groups Messages with WWW::Yahoo::GroupsThe CodeRunning the HackHacking the HackHack #45. Gleaning Buzz from Yahoo!The CodeRunning the HackHacking the HackHack #46. Spidering the Yahoo! CatalogThe CodeRunning the HackHacking the HackSee AlsoHack #47. Tracking Additions to Yahoo!The CodeRunning the HackHacking the HackHack #48. Scattersearch with Yahoo! and GoogleThe CodeRunning the HackHacking the HackHack #49. Yahoo! Directory Mindshare in GoogleThe CodeRunning The HackHacking the HackHack #50. Weblog-Free Google ResultsThe CodeHacking the HackHack #51. Spidering, Google, and Multiple DomainsExample: Top 20 Searching on GoogleThe CodeRunning the HackHacking the HackHack #52. Scraping Amazon.com Product ReviewsThe CodeRunning the HackSee AlsoHack #53. Receive an Email Alert for Newly Added Amazon.com ReviewsThe CodeRunning the HackSee AlsoHack #54. Scraping Amazon.com Customer AdviceThe CodeRunning the HackSee AlsoHack #55. Publishing Amazon.com Associates StatisticsThe CodeRunning the HackSee AlsoHack #56. Sorting Amazon.com Recommendations by RatingThe CodeRunning the HackSee AlsoHack #57. Related Amazon.com Products with AlexaThe CodeRunning the HackHacking the HackHack #58. Scraping Alexa’s Competitive Data with JavaThe CodeRunning the HackHacking the HackHack #59. Finding Album Information with FreeDB and Amazon.comGetting StartedChecking Your Disc IDDigging Up the FreeDB DetailsRocking with Amazon.comPresenting the ResultsHacking the HackHack #60. Expanding Your Musical TastesThe CodeRunning the HackHacking the HackChanging the number of results returnedLooking up artistsSee AlsoHack #61. Saving Daily Horoscopes to Your iPodThe CodeRunning the HackHacking the HackSee AlsoHack #62. Graphing Data with RRDTOOLThe CodeRunning the HackHacking the HackHack #63. Stocking Up on Financial QuotesThe CodeRunning the HackHacking the HackHack #64. Super Author SearchingGathering ToolsHacking the Library of CongressPerusing Project GutenbergNavigating the AmazonPresenting the ResultsRunning the HackHacking the HackHack #65. Mapping O’Reilly Best Sellers to Library PopularityThe CodeRunning the HackHacking the HackHack #66. Using All Consuming to Get Book ListsThe SOAP CodeMost-mentioned listsPersonal book listsBook metadata and weblog mentionsFriends and recommendationsThe REST CodeMost-mentioned listsPersonal book listsBook metadata and weblog mentionsFriends and recommendationsRunning the HackThe XML ResultsHacking the HackHack #67. Tracking Packages with FedExThe CodeRunning the HackHacking the HackHack #68. Checking Blogs for New CommentsThe CodeRunning the HackHacking the HackHack #69. Aggregating RSS and Posting ChangesThe CodeRunning the HackHacking the HackSee AlsoHack #70. Using the Link Cosmos of TechnoratiNeed Some REST?A Skeleton Key for WordsHack #71. Finding Related RSS FeedsFilling Up the ToolboxGetting the Dirt on FeedsReporting on Our FindingsHacking the HackHack #72. Automatically Finding Blogs of InterestThe CodeRunning the HackHacking the HackHack #73. Scraping TV ListingsThe CodeRunning the HackHack #74. What’s Your Visitor’s Weather Like?The CodeRunning the HackUsing and Hacking the HackHack #75. Trendspotting with GeotargetingThe CodeRunning the HackHacking the HackHack #76. Getting the Best Travel Route by TrainThe CodeRunning the HackHacking the HackHack #77. Geographic Distance and Back AgainThe Latitude/Longitude QuestionHacking the Latitude Out of MapPointThe CodeRunning the HackHacking the HackHack #78. Super Word LookupThe CodeRunning the HackHacking the HackUsing specific dictionariesClarifying the thesaurusHack #79. Word Associations with Lexical FreenetThe CodeRunning the HackHack #80. Reformatting Bugtraq ReportsThe CodeRunning The HackHacking the HackHack #81. Keeping Tabs on the Web via EmailPlanning for ChangeCalling In Outside HelpSend Out the NewsHacking the HackHack #82. Publish IE’s Favorites to Your Web SiteIE’s FavoritesWhat It Does and How It WorksThe CodeRunning the HackHacking the HackHack #83. Spidering GameStop.com Game PricesThe CodeRunning the HackHacking the HackGameStop by keywordPutting the results in a different formatHack #84. Bargain Hunting with PHPThe CodeRunning the HackHacking the HackHack #85. Aggregating Multiple Search Engine ResultsThe CodeRunning the HackHack #86. Robot KaraokeThe CodeRunning the HackHack #87. Searching the Better Business BureauThe CodeRunning the HackHacking the HackHack #88. Searching for Health InspectionsThe CodeRunning the HackHacking the HackHack #89. Filtering for the NaughtiesThe CodeRunning the HackHacking the Hack
5. Maintaining Your Collections
Hack 96. Hacks #90-93Hack #90. Using cron to Automate TasksSee AlsoHack #91. Scheduling Tasks Without cronDo You Really Need Anything cron-Like?Running Scripts on the Client SideUsing Perl’s sleep FunctionScheduling with Something Besides cronUsing Hosted cron ServicesHack #92. Mirroring Web Sites with wget and rsyncMirroring via the WebMirroring Directly with the ServerHacking the HackHack #93. Accumulating Search Results Over TimeThe CodeRunning the HackHacking the HackSee Also
6. Giving Back to the World
Hack 101. Hacks #94-100Hack #94. Using XML::RSS to Repurpose DataSee AlsoHack #95. Placing RSS Headlines on Your SiteThe CodeRunning the HackHack #96. Making Your Resources Scrapable with Regular ExpressionsThe Challenge of Web ScrapingNavigating between web resourcesExtracting specific informationHow to Be Nicer to ScrapersMake resources easier to locate and acquireMaking data easier to extractHacking the HackHack #97. Making Your Resources Scrapable with a REST InterfaceNavigating One URI at a TimeNegotiating Better ContentSee AlsoHack #98. Making Your Resources Scrapable with XML-RPCEnter Web ServicesBuilding the serviceMaking the service usefulUsing the service from the client sideHacking a scrape together with a serviceHacking the HackHack #99. Creating an IM InterfaceThe CodeRunning the HackHack #100. Going Beyond the BookUsing Google and Other Search EnginesMailing ListsWeb Sites
Index

About the Authors
Colophon
Copyright

Content preview from Spidering Hacks

Preface

When the Web began, it was a pretty small place. It didn’t take much to keep abreast of new sites, and with subject indexes like the fledgling Yahoo! and NCSA’s “What’s New” page, you could actually give keeping up with newly added pages the old college try.

Now, even the biggest search engines—yes, even Google—admit they don’t index the entire Web. It’s simply not possible. At the same time, the Web is more compelling than ever. More information is being put online at a faster clip—be it up-to-the-minute data or large collections of old materials finding an online home. The Web is more browsable, more searchable, and more useful than it ever was when it was still small. That said, we, its users, can only go so fast when searching, processing, and taking in information.

Thankfully, spidering allows us to bring a bit of sanity to the wealth of information available. Spidering is the process of automating the grabbing and sifting of information on the Web, saving us the trouble of having to browse it all manually. Spiders range in complexity from the simplest script to grab the latest weather information from a web page, to the armies of complex spiders working in concert with one another, searching, cataloging, and indexing the Web’s more than three billion resources for a search engine like Google.

This book teaches you the methodologies and algorithms behind spiders and the variety of ways that spiders can be used. Hopefully, it will inspire you to come up with some useful ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596005776

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Spidering Hacks

by Morbus Iff, Tara Calishain

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.