13.8. Obtaining the HTML from a URL
Problem
You need to get the HTML returned from a web server in order to examine it for items of interest. For example, you could examine the returned HTML for links to other pages or for headlines from a news site.
Solution
We can use the methods for web communication we have set up in
Recipe 13.5 and Recipe 13.6 to make the HTTP request and verify the
response; then, we can get at the HTML via the
ResponseStream property of the
HttpWebResponse
object:
public static string GetHTMLFromURL(string url)
{
if(url.Length == 0)
throw new ArgumentException("Invalid URL","url");
string html = "";
HttpWebRequest request = GenerateGetOrPostRequest(url,"GET",null);
HttpWebResponse response = (HttpWebResponse)request.GetResponse( );
try
{
if(VerifyResponse(response)== ResponseCategories.Success)
{
// get the response stream.
Stream responseStream = response.GetResponseStream( );
// use a stream reader that understands UTF8
StreamReader reader = new StreamReader(responseStream,Encoding.UTF8);
try
{
html = reader.ReadToEnd( );
}
finally
{
// close the reader
reader.Close( );
}
}
}
finally
{
response.Close( );
}
return html;
}Discussion
The GetHTMLFromURL method is set up to get a web
page using the
GenerateGetOrPostRequest
and GetResponse methods, verify the response using
the
VerifyResponse method,
and then, once we have a valid response, we start looking for the
HTML that was returned.
The
GetResponseStream
method on the HttpWebResponse provides access ...