Bringing this down to a more concrete level is the Java URL class. The URL class represents a URL address and provides a simple API for accessing web resources, such as documents and applications on servers. It can use an extensible set of protocol and content handlers to perform the necessary communication and in theory even data conversion. With the URL class, an application can open a connection to a server on the network and retrieve content with just a few lines of code. As new types of servers and new formats for content evolve, additional URL handlers can be supplied to retrieve and interpret the data without modifying your applications.
A URL is represented by an instance of the java.net.URL
class. A URL
object manages all the component information
within a URL string and provides methods for retrieving the object it
identifies. We can construct a URL
object from a URL string or from its component parts:
try
{
URL
aDoc
=
new
URL
(
"http://foo.bar.com/documents/homepage.html"
);
URL
sameDoc
=
new
URL
(
"http"
,
"foo.bar.com"
,
"documents/homepage.html"
);
}
catch
(
MalformedURLException
e
)
{
...
}
These two URL
objects point to
the same network resource, the homepage.html document
on the server foo.bar.com. Whether the resource
actually exists and is available isn’t known until we try to access it.
When initially constructed, the URL
object contains only data about the object’s location and how to access
it. No connection to the server has been made. We can examine the various
parts of the URL
with the getProtocol()
, getHost()
, and getFile()
methods. We can
also compare it to another URL
with the
sameFile()
method (which
has an unfortunate name for something that may not point to a file).
sameFile()
determines whether two URLs
point to the same resource. It can be fooled, but sameFile()
does more than compare the URL
strings for equality; it takes into account the possibility that one
server may have several names as well as other factors. (It doesn’t go as
far as to fetch the resources and compare them, however.)
When a URL
is created, its
specification is parsed to identify just the protocol component. If the
protocol doesn’t make sense, or if Java can’t find a protocol handler for
it, the URL constructor throws a MalformedURLException
. A
protocol handler is a Java class that implements the
communications protocol for accessing the URL resource. For example, given
an http
URL, Java prepares to use the
HTTP protocol handler to retrieve documents from the specified web
server.
As of Java 7, URL protocol handlers are guaranteed to be provided
for http
, https
(secure HTTP), and
ftp
, as well as local
file
URLs and jar
URLs that refer to
files inside JAR archives. Outside of that, it gets a little dicey. We’ll
talk more about the issues surrounding content and protocol handlers a bit
later in this chapter.
The lowest-level and most general way to get data back
from a URL
is to ask for an InputStream
from the URL
by calling openStream()
. Getting
the data as a stream may also be useful if you want to receive
continuous updates from a dynamic information source. The drawback is
that you have to parse the contents of the byte stream yourself. Working
in this mode is basically the same as working with a byte stream from
socket communications, but the URL protocol handler has already dealt
with all of the server communications and is providing you with just the
content portion of the transaction. Not all types of URLs support the
openStream()
method because not all
types of URLs refer to concrete data; you’ll get an UnknownServiceException
if the URL doesn’t.
The following code prints the contents of an HTML file on a web server:
try
{
URL
url
=
new
URL
(
"http://server/index.html"
);
BufferedReader
bin
=
new
BufferedReader
(
new
InputStreamReader
(
url
.
openStream
()
));
String
line
;
while
(
(
line
=
bin
.
readLine
())
!=
null
)
{
System
.
out
.
println
(
line
);
}
bin
.
close
();
}
catch
(
Exception
e
)
{
}
We ask for an InputStream
with
openStream()
and wrap it in a
BufferedReader
to read the lines of
text. Because we specify the http
protocol in the URL, we enlist the services of an HTTP protocol handler.
Note that we haven’t talked about content handlers yet. In this case,
because we’re reading directly from the input stream, no content handler
(no transformation of the content data) is involved.
As we said previously, reading raw content from a stream
is the most general mechanism for accessing data over the Web. openStream()
leaves the parsing of data up to
you. The URL class, however, was intended to support a more
sophisticated, pluggable, content-handling mechanism. We’ll discuss this
now, but be aware that it is not widely used because of lack of
standardization and limitations in how you can deploy new handlers.
Although the Java community made some progress in recent years in
standardizing a small set of protocol handlers, no such effort was made
to standardize content handlers. This means that although this part of
the discussion is interesting, its usefulness is limited.
The way it’s supposed to work is that when Java knows the type of
content being retrieved from a URL and a proper content handler is
available, you can retrieve the URL
content as an appropriate Java object by calling the URL
’s getContent()
method. In
this mode of operation, getContent()
initiates a connection to the host, fetches the data for you, determines
the type of data, and then invokes a content handler to turn the bytes
into a Java object. It acts sort of as if you had read a serialized Java
object, as in Chapter 13. Java will try to
determine the type of the content by looking at its MIME type, its file extension, or even by examining the
bytes directly.
For example, given the URL
http://foo.bar.com/index.html , a call to
getContent()
uses the HTTP protocol
handler to retrieve data and might use an HTML content handler to turn
the data into an appropriate document object. Similarly, a GIF file
might be turned into an AWT ImageProducer
object
using a GIF content handler. If we access the GIF file using an FTP URL,
Java would use the same content handler but a different protocol handler
to receive the data.
Since the content handler must be able to return any type of
object, the return type of getContent()
is Object
. This might leave us wondering what
kind of object we got. In a moment, we’ll describe how we could ask the
protocol handler about the object’s MIME type. Based on this, and
whatever other knowledge we have about the kind of object we are
expecting, we can cast the Object
to
its appropriate, more specific type. For example, if we expect an image,
we might cast the result of getContent()
to ImageProducer
:
try
{
ImageProducer
ip
=
(
ImageProducer
)
myURL
.
getContent
();
}
catch
(
ClassCastException
e
)
{
...
}
Various kinds of errors can occur when trying to retrieve the
data. For example, getContent()
can throw
an IOException
if there is
a communications error. Other kinds of errors can occur at the
application level: some knowledge of how the application-specific
content and protocol handlers deal with errors is necessary. One problem
that could arise is that a content handler for the data’s MIME type
wouldn’t be available. In this case, getContent()
invokes a special “unknown type”
handler that returns the data as a raw InputStream
(back to square one).
In some situations, we may also need knowledge of the protocol
handler. For example, consider a URL
that refers to a nonexistent file on an HTTP server. When requested, the
server returns the familiar “404 Not Found” message. To deal with
protocol-specific operations like this, we may need to talk to the
protocol handler, which we’ll discuss next.
Upon calling openStream()
or
getContent()
on a URL
, the protocol handler is consulted and a
connection is made to the remote server or location. Connections are
represented by a URLConnection
object,
subtypes of which manage different protocol-specific communications and
offer additional metadata about the source. The HttpURLConnection
class, for example, handles
basic web requests and also adds some HTTP-specific capabilities such as
interpreting “404 Not Found” messages and other web server errors. We’ll
talk more about HttpURLConnection
later in this chapter.
We can get a URLConnection
from
our URL
directly with the openConnection()
method. One of the things we
can do with the URLConnection
is ask
for the object’s content type before reading data. For example:
URLConnection
connection
=
myURL
.
openConnection
();
String
mimeType
=
connection
.
getContentType
();
InputStream
in
=
connection
.
getInputStream
();
Despite its name, a URLConnection
object is initially created in a
raw, unconnected state. In this example, the network connection was not
actually initiated until we called the getContentType()
method. The URLConnection
does not
talk to the source until data is requested or its connect()
method is explicitly invoked. Prior
to connection, network parameters and protocol-specific features can be
set up. For example, we can set timeouts on the initial connection to
the server and on reads:
URLConnection
connection
=
myURL
.
openConnection
();
connection
.
setConnectTimeout
(
10000
);
// milliseconds
connection
.
setReadTimeout
(
10000
);
// milliseconds
InputStream
in
=
connection
.
getInputStream
();
As we’ll see in the section “Using the POST Method,” we can get at
the protocol-specific information by casting the URLConnection
to its specific
subtype.
The content- and protocol-handler mechanisms we’ve described are very flexible; to handle new types of URLs, you need only add the appropriate handler classes. One interesting application of this would be Java-based web browsers that could handle new and specialized kinds of URLs by downloading them over the Net. The idea for this was touted in the earliest days of Java. Unfortunately, it never came to fruition. There is no API for dynamically downloading new content and protocol handlers. In fact, there is no standard API for determining what content and protocol handlers exist on a given platform.
Java currently mandates protocol handlers for HTTP, HTTPS, FTP, FILE, and JAR. While in practice you will generally find these basic protocol handlers with all versions of Java, that’s not entirely comforting, and the story for content handlers is even less clear. The standard Java classes don’t, for example, include content handlers for HTML, GIF, JPEG, or other common data types. Furthermore, although content and protocol handlers are part of the Java API and an intrinsic part of the mechanism for working with URLs, specific content and protocol handlers aren’t defined. Even those protocol handlers that have been bundled in Java are still packaged as part of the Sun implementation classes and are not truly part of the core API for all to see.
In summary, the Java content- and protocol-handler mechanism was a forward-thinking approach that never quite materialized. The promise of web browsers that dynamically extend themselves for new types of protocols and new content is, like flying cars, always just a few years away. Although the basic mechanics of the protocol-handler mechanism are useful (especially now with some standardization) for decoding content in your own applications, you should probably turn to other, newer frameworks that have a bit more specificity.
The idea of dynamically downloadable handlers could also be applied to other kinds of handler-like components. For example, the Java XML community is fond of referring to XML as a way to apply semantics (meaning) to documents and to Java as a portable way to supply the behavior that goes along with those semantics. It’s possible that an XML viewer could be built with downloadable handlers for displaying XML tags.
The JavaBeans APIs touch upon this subject with the Java Activation Framework (JAF), which provides a way to detect the data stream type and “encapsulate access to it” in a Java bean. If this sounds suspiciously like the content handler’s job, it is. Unfortunately, it looks like these APIs will not be merged and, outside of the Java Mail API, the JAF has not been widely used.
Fortunately, for working with URL streams of images, music, and video, very mature APIs are available. The Java Advanced Imaging API (JAI) includes a well-defined, extensible set of handlers for most image types, and the Java Media Framework (JMF) can play most common music and video types found online.
Get Learning Java, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.