June 2001
Intermediate to advanced
888 pages
21h 1m
English
You need to extract all the HTML tags from a URL.
Use this simple HTML tag extractor.
A simple HTML extractor can be made by reading a character at a time
and looking for < and > tags. This is reasonably efficient if a
BufferedReader is used.
The ReadTag
program shown in Example 17-5 implements this; given a URL, it opens the
file (similar to TextBrowser in Section 17.7) and extracts the HTML tags. Each tag is
printed to the standard output.
Example 17-5. ReadTag.java
/** A simple but reusable HTML tag extractor. */ public class ReadTag { /** The URL that this ReadTag object is reading */ protected URL myURL = null; /** The Reader for this object */ protected BufferedReader inrdr = null; /* Simple main showing one way of using the ReadTag class. */ public static void main(String[] args) throws MalformedURLException, IOException { if (args.length == 0) { System.err.println("Usage: ReadTag URL [...]"); return; } for (int i=0; i<args.length; i++) { ReadTag rt = new ReadTag(args[0]); String tag; while ((tag = rt.nextTag( )) != null) { System.out.println(tag); } rt.close( ); } } /** Construct a ReadTag given a URL String */ public ReadTag(String theURLString) throws IOException, MalformedURLException { this(new URL(theURLString)); } /** Construct a ReadTag given a URL */ public ReadTag(URL theURL) throws IOException { myURL = theURL; // Open the URL for reading inrdr = new BufferedReader(new InputStreamReader(myURL.openStream( ...