O'Reilly logo

Java Cookbook by Ian F. Darwin

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Extracting HTML from a URL

Problem

You need to extract all the HTML tags from a URL.

Solution

Use this simple HTML tag extractor.

Discussion

A simple HTML extractor can be made by reading a character at a time and looking for < and > tags. This is reasonably efficient if a BufferedReader is used.

The ReadTag program shown in Example 17-5 implements this; given a URL, it opens the file (similar to TextBrowser in Section 17.7) and extracts the HTML tags. Each tag is printed to the standard output.

Example 17-5. ReadTag.java

/** A simple but reusable HTML tag extractor. */ public class ReadTag { /** The URL that this ReadTag object is reading */ protected URL myURL = null; /** The Reader for this object */ protected BufferedReader inrdr = null; /* Simple main showing one way of using the ReadTag class. */ public static void main(String[] args) throws MalformedURLException, IOException { if (args.length == 0) { System.err.println("Usage: ReadTag URL [...]"); return; } for (int i=0; i<args.length; i++) { ReadTag rt = new ReadTag(args[0]); String tag; while ((tag = rt.nextTag( )) != null) { System.out.println(tag); } rt.close( ); } } /** Construct a ReadTag given a URL String */ public ReadTag(String theURLString) throws IOException, MalformedURLException { this(new URL(theURLString)); } /** Construct a ReadTag given a URL */ public ReadTag(URL theURL) throws IOException { myURL = theURL; // Open the URL for reading inrdr = new BufferedReader(new InputStreamReader(myURL.openStream( ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required