O'Reilly logo

Java Cookbook by Ian F. Darwin

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Extracting URLs from a File

Problem

You need to extract just the URLs from a file.

Solution

Use ReadTag from Section 17.8, and just look for tags that might contain URLs.

Discussion

The program in Example 17-6 uses ReadTag from the previous recipe and checks each tag to see if it is a “wanted tag” defined in the array wantedTags. These include A (anchor), IMG (image), and APPLET tags. If it is determined to be a wanted tag, the URL is extracted from the tag and printed.

Example 17-6. GetURLs.java

public class GetURLs { /** The tag reader */ ReadTag reader; public GetURLs(URL theURL) throws IOException { reader = new ReadTag(theURL); } public GetURLs(String theURL) throws MalformedURLException, IOException { reader = new ReadTag(theURL); } /* The tags we want to look at */ public final static String[] wantTags = { "<a ", "<A ", "<applet ", "<APPLET ", "<img ", "<IMG ", "<frame ", "<FRAME ", }; public ArrayList getURLs( ) throws IOException { ArrayList al = new ArrayList( ); String tag; while ((tag = reader.nextTag( )) != null) { for (int i=0; i<wantTags.length; i++) { if (tag.startsWith(wantTags[i])) { al.add(tag); continue; // optimization } } } return al; } public void close( ) throws IOException { if (reader != null) reader.close( ); } public static void main(String[] argv) throws MalformedURLException, IOException { String theURL = argv.length == 0 ? "http://localhost/" : argv[0]; GetURLs gu = new GetURLs(theURL); ArrayList urls = gu.getURLs( ); Iterator urlIterator = urls.iterator( ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required