The URL Class

Bringing this down to a more concrete level is the Java URL class. The URL class represents a URL address and provides a simple API for accessing web resources, such as documents and applications on servers. It can use an extensible set of protocol and content handlers to perform the necessary communication and in theory even data conversion. With the URL class, an application can open a connection to a server on the network and retrieve content with just a few lines of code. As new types of servers and new formats for content evolve, additional URL handlers can be supplied to retrieve and interpret the data without modifying your applications.

A URL is represented by an instance of the java.net.URL class. A URL object manages all the component information within a URL string and provides methods for retrieving the object it identifies. We can construct a URL object from a URL string or from its component parts:

try {
    URL aDoc =
      new URL( "http://foo.bar.com/documents/homepage.html" );
    URL sameDoc =
      new URL("http","foo.bar.com","documents/homepage.html");
} catch ( MalformedURLException e ) { ... }

These two URL objects point to the same network resource, the homepage.html document on the server foo.bar.com. Whether the resource actually exists and is available isn’t known until we try to access it. When initially constructed, the URL object contains only data about the object’s location and how to access it. No connection to the server has been made. We can examine the various parts of the URL with the getProtocol(), getHost(), and getFile() methods. We can also compare it to another URL with the sameFile() method (which has an unfortunate name for something that may not point to a file). sameFile() determines whether two URLs point to the same resource. It can be fooled, but sameFile() does more than compare the URL strings for equality; it takes into account the possibility that one server may have several names as well as other factors. (It doesn’t go as far as to fetch the resources and compare them, however.)

When a URL is created, its specification is parsed to identify just the protocol component. If the protocol doesn’t make sense, or if Java can’t find a protocol handler for it, the URL constructor throws a MalformedURLException. A protocol handler is a Java class that implements the communications protocol for accessing the URL resource. For example, given an http URL, Java prepares to use the HTTP protocol handler to retrieve documents from the specified web server.

As of Java 7, URL protocol handlers are guaranteed to be provided for http, https (secure HTTP), and ftp, as well as local file URLs and jar URLs that refer to files inside JAR archives. Outside of that, it gets a little dicey. We’ll talk more about the issues surrounding content and protocol handlers a bit later in this chapter.

Stream Data

The lowest-level and most general way to get data back from a URL is to ask for an InputStream from the URL by calling openStream(). Getting the data as a stream may also be useful if you want to receive continuous updates from a dynamic information source. The drawback is that you have to parse the contents of the byte stream yourself. Working in this mode is basically the same as working with a byte stream from socket communications, but the URL protocol handler has already dealt with all of the server communications and is providing you with just the content portion of the transaction. Not all types of URLs support the openStream() method because not all types of URLs refer to concrete data; you’ll get an UnknownServiceException if the URL doesn’t.

The following code prints the contents of an HTML file on a web server:

try {
    URL url = new URL("http://server/index.html");
  
    BufferedReader bin = new BufferedReader (
        new InputStreamReader( url.openStream() ));
  
    String line;
    while ( (line = bin.readLine()) != null ) {
        System.out.println( line );
    }
    bin.close();
} catch (Exception e) { }

We ask for an InputStream with openStream() and wrap it in a BufferedReader to read the lines of text. Because we specify the http protocol in the URL, we enlist the services of an HTTP protocol handler. Note that we haven’t talked about content handlers yet. In this case, because we’re reading directly from the input stream, no content handler (no transformation of the content data) is involved.

Getting the Content as an Object

As we said previously, reading raw content from a stream is the most general mechanism for accessing data over the Web. openStream() leaves the parsing of data up to you. The URL class, however, was intended to support a more sophisticated, pluggable, content-handling mechanism. We’ll discuss this now, but be aware that it is not widely used because of lack of standardization and limitations in how you can deploy new handlers. Although the Java community made some progress in recent years in standardizing a small set of protocol handlers, no such effort was made to standardize content handlers. This means that although this part of the discussion is interesting, its usefulness is limited.

The way it’s supposed to work is that when Java knows the type of content being retrieved from a URL and a proper content handler is available, you can retrieve the URL content as an appropriate Java object by calling the URL’s getContent() method. In this mode of operation, getContent() initiates a connection to the host, fetches the data for you, determines the type of data, and then invokes a content handler to turn the bytes into a Java object. It acts sort of as if you had read a serialized Java object, as in Chapter 13. Java will try to determine the type of the content by looking at its MIME type, its file extension, or even by examining the bytes directly.

For example, given the URL http://foo.bar.com/index.html , a call to getContent() uses the HTTP protocol handler to retrieve data and might use an HTML content handler to turn the data into an appropriate document object. Similarly, a GIF file might be turned into an AWT ImageProducer object using a GIF content handler. If we access the GIF file using an FTP URL, Java would use the same content handler but a different protocol handler to receive the data.

Since the content handler must be able to return any type of object, the return type of getContent() is Object. This might leave us wondering what kind of object we got. In a moment, we’ll describe how we could ask the protocol handler about the object’s MIME type. Based on this, and whatever other knowledge we have about the kind of object we are expecting, we can cast the Object to its appropriate, more specific type. For example, if we expect an image, we might cast the result of getContent() to ImageProducer:

try  {
    ImageProducer ip = (ImageProducer)myURL.getContent();
} catch ( ClassCastException e ) { ... }

Various kinds of errors can occur when trying to retrieve the data. For example, getContent() can throw an IOException if there is a communications error. Other kinds of errors can occur at the application level: some knowledge of how the application-specific content and protocol handlers deal with errors is necessary. One problem that could arise is that a content handler for the data’s MIME type wouldn’t be available. In this case, getContent() invokes a special “unknown type” handler that returns the data as a raw InputStream (back to square one).

In some situations, we may also need knowledge of the protocol handler. For example, consider a URL that refers to a nonexistent file on an HTTP server. When requested, the server returns the familiar “404 Not Found” message. To deal with protocol-specific operations like this, we may need to talk to the protocol handler, which we’ll discuss next.

Managing Connections

Upon calling openStream() or getContent() on a URL, the protocol handler is consulted and a connection is made to the remote server or location. Connections are represented by a URLConnection object, subtypes of which manage different protocol-specific communications and offer additional metadata about the source. The HttpURLConnection class, for example, handles basic web requests and also adds some HTTP-specific capabilities such as interpreting “404 Not Found” messages and other web server errors. We’ll talk more about HttpURLConnection later in this chapter.

We can get a URLConnection from our URL directly with the openConnection() method. One of the things we can do with the URLConnection is ask for the object’s content type before reading data. For example:

URLConnection connection = myURL.openConnection();
String mimeType = connection.getContentType();
InputStream in = connection.getInputStream();

Despite its name, a URLConnection object is initially created in a raw, unconnected state. In this example, the network connection was not actually initiated until we called the getContentType() method. The URLConnection does not talk to the source until data is requested or its connect() method is explicitly invoked. Prior to connection, network parameters and protocol-specific features can be set up. For example, we can set timeouts on the initial connection to the server and on reads:

URLConnection connection = myURL.openConnection();
connection.setConnectTimeout( 10000 ); // milliseconds
connection.setReadTimeout( 10000 ); // milliseconds
InputStream in = connection.getInputStream();

As we’ll see in the section “Using the POST Method,” we can get at the protocol-specific information by casting the URLConnection to its specific subtype.

Handlers in Practice

The content- and protocol-handler mechanisms we’ve described are very flexible; to handle new types of URLs, you need only add the appropriate handler classes. One interesting application of this would be Java-based web browsers that could handle new and specialized kinds of URLs by downloading them over the Net. The idea for this was touted in the earliest days of Java. Unfortunately, it never came to fruition. There is no API for dynamically downloading new content and protocol handlers. In fact, there is no standard API for determining what content and protocol handlers exist on a given platform.

Java currently mandates protocol handlers for HTTP, HTTPS, FTP, FILE, and JAR. While in practice you will generally find these basic protocol handlers with all versions of Java, that’s not entirely comforting, and the story for content handlers is even less clear. The standard Java classes don’t, for example, include content handlers for HTML, GIF, JPEG, or other common data types. Furthermore, although content and protocol handlers are part of the Java API and an intrinsic part of the mechanism for working with URLs, specific content and protocol handlers aren’t defined. Even those protocol handlers that have been bundled in Java are still packaged as part of the Sun implementation classes and are not truly part of the core API for all to see.

In summary, the Java content- and protocol-handler mechanism was a forward-thinking approach that never quite materialized. The promise of web browsers that dynamically extend themselves for new types of protocols and new content is, like flying cars, always just a few years away. Although the basic mechanics of the protocol-handler mechanism are useful (especially now with some standardization) for decoding content in your own applications, you should probably turn to other, newer frameworks that have a bit more specificity.

Useful Handler Frameworks

The idea of dynamically downloadable handlers could also be applied to other kinds of handler-like components. For example, the Java XML community is fond of referring to XML as a way to apply semantics (meaning) to documents and to Java as a portable way to supply the behavior that goes along with those semantics. It’s possible that an XML viewer could be built with downloadable handlers for displaying XML tags.

The JavaBeans APIs touch upon this subject with the Java Activation Framework (JAF), which provides a way to detect the data stream type and “encapsulate access to it” in a Java bean. If this sounds suspiciously like the content handler’s job, it is. Unfortunately, it looks like these APIs will not be merged and, outside of the Java Mail API, the JAF has not been widely used.

Fortunately, for working with URL streams of images, music, and video, very mature APIs are available. The Java Advanced Imaging API (JAI) includes a well-defined, extensible set of handlers for most image types, and the Java Media Framework (JMF) can play most common music and video types found online.

Get Learning Java, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.