O'Reilly logo

Java Servlet & JSP Cookbook by Bruce W. Perry

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 26. Harvesting Web Information

Introduction

The Web contains information galore. Much of this information is freely available by simply surfing over to an organization’s web site and reading their pages or search results. However, it can be difficult separating the dross from the gems. The vast majority of a web page’s visual components are typically dedicated to menus, logos, advertising banners, and fancy applets or Flash movies. What if all you are interested in is a tiny nugget of data awash in an ocean of HTML?

The answer lies in using Java to parse a web page to extract only certain pieces of information from it. The web terms for this task are harvesting or scraping information from a web page. Perhaps web services (Chapter 27) will eventually replace the need to harvest web data. But until most major sites have their web services APIs up and running, you can use Java and certain javax.swing.text subpackages to pull specified text from web pages.

How does it work? Basically, your Java program uses HTTP to connect with a web page and pull in its HTML text.

Tip

Parsing the HTML from web sites still involves transferring the entire web page over the network, even if you are only interested in a fraction of its information. This is why using web services is a much more efficient manner of sharing specific data from a web site.

Then use Java code to parse the HTML page in order to pull from it only the piece of data you are interested in, such as weather data or a stock quote. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required