Hack 1. Hacks #1-7Hack #1. A Crash Course in Spidering and ScrapingWhy Spider?Hack #2. Best Practices for You and Your SpiderBe Liberal in What You AcceptDon’t Limit Your DatasetDon’t Reinvent the WheelBest Practices for YouChoose the most structured format availableIf you must scrape HTML, do so sparinglyUse the right tool for the jobDon’t go where you’re not wantedChoose a good identifierMake information on your spider readily availableDon’t demand unlimited site access or supportBest Practices for Your SpiderRespect robots.txtGo light on the bandwidthTake just enough, and don’t take too oftenHack #3. Anatomy of an HTML PageAnatomy of an HTML PageHeader Information with the H TagsList Information with Special HTML TagsNon-HTML FilesHack #4. Registering Your SpiderNaming Your SpiderA Web Page About Your SpiderPlaces to Register Your SpiderHack #5. Preempting DiscoveryMaking ContactMaking the Arguments for Your SpiderMaking Your Spider Easy to Find and Learn AboutConsidering Legal IssuesHack #6. Keeping Your Spider Out of Sticky SituationsBad Spider, No Biscuit!Violating CopyrightAggregating DataCompetitive IntelligencePossible Consequences of Misbehaving SpidersTracking Legal IssuesHack #7. Finding the Patterns of IdentifiersArbitrary Classification Systems Within a CollectionClassification Systems that Use an Established Universal Taxonomy Within a CollectionClassification Systems that Identify Documents Across a Wide Number of CollectionsSome Large Collections with ID Numbers