Chapter 26. Harvesting Web Information
Introduction
The Web contains information galore. Much of this information is freely available by simply surfing over to an organization’s web site and reading their pages or search results. However, it can be difficult separating the dross from the gems. The vast majority of a web page’s visual components are typically dedicated to menus, logos, advertising banners, and fancy applets or Flash movies. What if all you are interested in is a tiny nugget of data awash in an ocean of HTML?
The answer lies in using Java to parse a web page
to extract only certain pieces of
information from it. The web terms for
this task are
harvesting or
scraping
information from a web page. Perhaps web
services (Chapter 27) will
eventually replace the need to harvest
web data. But until most major sites
have their web services APIs up and
running, you can use Java and certain
javax.swing.text
subpackages to pull specified text from
web pages.
How does it work? Basically, your Java program uses HTTP to connect with a web page and pull in its HTML text.
Tip
Parsing the HTML from web sites still involves transferring the entire web page over the network, even if you are only interested in a fraction of its information. This is why using web services is a much more efficient manner of sharing specific data from a web site.
Then use Java code to parse the HTML page in order to pull from it only the piece of data you are interested in, such as weather data or a stock quote. ...
Get Java Servlet & JSP Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.