Chapter 3. Collecting Media Files

Hack 37. Hacks #33-42

The easiest data to scrape and spider is entire files, not the specific information within. With one line of download code, you can have the grandeur of a movie, the sound of music, or the beauty of an image. Getting to that one line of power, however, often involves some detective work: finding out exactly where your desired files are stored, and the simplest and laziest way of getting your own copy.

In this chapter, we’ll explore the techniques for building your media collection, by archiving freeware clipart, watching old movies from the Library of Congress, or saving historic images from a scenic web cam.

Hack #33. Detective Case Study: Newgrounds

Learn how to gumshoe your way through a site’s workflow, regardless of whether there are pop-up windows, JavaScripts, frames, or other bits of obscuring technology.

In this hack, we’re going to create a script to suck down the media files of Newgrounds (http://newgrounds.com), a site that specializes in odd Flash animations and similar videos. Before we can get to the code, we have to do a little bit of sleuthing to see how Newgrounds handles its operation.

Anytime we prepare to suck data from a site, especially one that isn’t just plain old static pages, the first thing we should keep in mind is the URL. Even though we don’t have a manual to the coding prowess that went into the design, we really don’t need one; we just need to pay attention, make some guesses, and get enough of ...

Get Spidering Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.