O'Reilly logo

Python & XML by Christopher A. Jones, Fred L. Drake

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Building a Web Application

Now you can use your new knowledge of the DOM to create a simple web application. Let’s build one that allows for the posting and viewing of articles. The articles are submitted and viewed via a web browser, but stored by the web server as XML, which allows the articles to be leveraged into different information systems that process XML. HTML articles, on the other hand, are unusable outside of a web browser.

Preparing the Web Server

In order to run the examples in this chapter, you must have a web server available that lets you execute CGI scripts. These examples were designed on Apache, so the CGI scripts contain a sh-bang line that specified the path to the Python executable (the #!/usr/bin/python expression at the top of the file) so that Apache can run them just like any other CGI script. (Understanding the term "sh-bang” requires a little bit of knowledge of Unix history. The traditional command-line environment for Unix was originally implemented using the sh program. The exclamation point was named the “bang” character because it was always used after words such as “bang” and “pow” in comic books and cartoons. Since the lines at the top of scripts that started with #! were interpreted by the sh program, they came to be known as sh-bang lines.)

Ensuring the script’s execution

You must enable the execution of your Python scripts on your web server. On Apache, this means enabling CGI within the web directory, ensuring that the actual CGI scripts contain the pointer to the Python interpreter so they run correctly, and setting the “execute” permission on the script. This last item can be accomplished using the chmod program:

$> chmod +x start.cgi

On other web servers and on Windows, you need to assign a handler to your CGI scripts so that they are executed by the Python interpreter. This may require that you name your scripts with a .py extension, as opposed to a .cgi extension, if .cgi is already assigned to another handler.

Enabling write permission

Beyond just being able to execute scripts within a web directory, the web user must also have write access to the directory for the examples to work. The examples are meant to illustrate the manipulation of XML and the ability to repurpose accessible XML into different applications.

To avoid dependency on a database in this chapter, and to provide easy access to the XML, these examples use the filesystem directly for storage. Articles are stored to disk as .xml files.

For Apache, you must give the user nobody write access to the specific web directory. If you are serving pages out of /home/httpd/myXMLApplication, you need to set up something like the following:

$> mkdir /home/httpd/myXMLApplication
$> chown nobody /home/httpd/myXMLApplication
$> chmod 755 /home/httpd/myXMLApplication

This gives the user nobody (the user ID that Apache runs under) write access to the directory. There are many other ways to securely set this up; this is simply one option. In general, for production web applications, it’s a good idea not to give write access to web users.

The Web Application Structure

The web application is driven mainly by one script, start.cgi. The script does most of the processing, serves the content templates, and invokes the objects capable of storing and retrieving your XML articles. The primary components consist of the article object, the storage object, the article manager, the SAX-based article handler, and the start.cgi script that manages the whole process. Figure 4-2 shows a diagram of the major components.

The site architecture
Figure 4-2. The site architecture

In the next few sections, we examine the code and operation of the CGI components in detail.

The Article class

The Article class represents an article as XML information. It’s a thin class with methods only for creating an article from existing XML, or for retrieving the XML that makes up the article as a string. In addition, it has modifiable attributes that allow you to manipulate the content of the article:

  def __init__(self):
    """Set initial data attributes."""
    self.reset(  )

  def reset(self):
    self.title       = ""
    self.size        = 0
    self.time        = "" # pretty-printing time string
    self.author      = ""
    self.contributor = ""
    self.contents    = ""

The attributes can be modified during the life of an article to keep you from having to create XML in your program. For example:

>>> from article import Article
>>> art = Article(  )
>>> art.title = "FirstPost"
>>> art.contents = "This is the first article."
>>> print art.getXML(  )
<?xml version="1.0"?>
<article title="FirstPost">
  <contents>
This is the first article.
  </contents>
</article>

The getXML method call has the logic to recreate the XML when necessary. You can create articles with a well-formed string of XML, or by loading a string of XML from a disk file. The getXML method exists as a means for you to pull the XML back out of the object. Note the use of the escape function, which we imported from the xml.sax.saxutils module; this ensures that characters that are syntactically significant to XML are properly encoded in the result.

  def getXML(self):
    """Returns XML after re-assembling from data
    members that may have changed."""

    attr = ''
    if self.title:
      attr = ' title="%s"' % escape(self.title)
    s = '<?xml version="1.0"?>\n<article%s>\n' % attr
    if self.author:
      s = '%s  <author name="%s" />\n' % (s, escape(self.author))
    if self.contributor:
     s = '%s  <contributor name="%s" />\n' % (s, escape(self.contributor))
    if self.contents:
      s = ('%s  <contents>\n%s\n  </contents>\n'
           % (s, escape(self.contents)))
    return s + "</article>\n"

The fromXML method of the article class populates the current XML article object with the values from the supplied string. This method uses the convenience function parseString, from xml.dom.minidom, to load the XML data into a document object, and then uses the content retrieval methods of the DOM to collect the required information:

  def fromXML(self, data):
    """Initialize using an XML document passed as a string."""
    self.reset()
    dom              = xml.dom.minidom.parseString(data)
    self.title       = get_attribute(dom, "article", "title")
    self.size        = int(get_attribute(dom, "size", "bytes") or 0)
    self.time        = get_attribute(dom, "time", "stime")
    self.author      = get_attribute(dom, "author", "name")
    self.contributor = get_attribute(dom, "contributor", "name")
    nodelist         = dom.getElementsByTagName("contents")
    if nodelist:
      assert len(nodelist) == 1
      contents = nodelist[0]
      contents.normalize()
      if contents.childNodes:
        self.contents = contents.firstChild.data.strip()

This method uses a convenience function defined elsewhere in the module. The function get_attribute looks into the document for an attribute and returns the value it finds; if the attribute it is looking for does not exist (or the element it expects to find it on does not exist), it returns an empty string instead. If it finds more than one element that matches the requested element type, it complains loudly using the assert statement. (For a real application, you would not use assert in this way, but this is sufficient for our examples since we’re mainly interested in the XML aspect.)

When working with the web site logic, most manipulation on article objects occurs by either using the Storage class to load an article from disk, or by parsing a form submission to create an article for a user and then using the Storage class to save the XML file to disk. Example 4-6 shows the complete listing of the Article class.

Example 4-6. Article class from article.py
import xml.dom.minidom
from xml.sax.saxutils import escape

class Article:
  """Represents a block of text and metadata created from XML."""

  def __init__(self):
    """Set initial data properties."""
    self.reset(  )

  def reset(self):
    """Re-initialize data properties."""
    self.title       = ""
    self.size        = 0
    self.time        = ""    # pretty-printing time string
    self.author      = ""
    self.contributor = ""
    self.contents    = ""

  def getXML(self):
    """Returns XML after re-assembling from data
    members that may have changed."""

    attr = ''
    if self.title:
      attr = ' title="%s"' % escape(self.title)
    s = '<?xml version="1.0"?>\n<article%s>\n' % attr
    if self.author:
    s = ('<?xml version="1.0"?>\n'
         '<article%s>\n' % attr)
    if self.author:
      s = '%s  <author name="%s" />\n' % (s, escape(self.author))
    if self.contributor:
      s = '%s  <contributor name="%s" />\n' % (s, escape(self.contributor))
    if self.contents:
      s = ('%s  <contents>\n%s\n  </contents>\n'
           % (s, escape(self.contents)))
    return s + "</article>\n"

  def fromXML(self, data):
    """Initialize using an XML document passed as a string."""
    self.reset(  )
    dom = xml.dom.minidom.parseString(data)
    self.title       = get_attribute(dom, "article", "title")
    self.size        = int(get_attribute(dom, "size", "bytes") or 0)
    self.time        = get_attribute(dom, "time", "stime")
    self.author      = get_attribute(dom, "author", "name")
    self.contributor = get_attribute(dom, "contributor", "name")
    nodelist         = dom.getElementsByTagName("contents")
    if nodelist:
      assert len(nodelist) == 1
      contents = nodelist[0]
      contents.normalize(  )
      if contents.childNodes:
        self.contents = contents.firstChild.data.strip(  )

# Helper function:

def get_attribute(dom, tagname, attrname):
  """Return the value of a solitary element & attribute,
  if available."""
  nodelist = dom.getElementsByTagName(tagname)
  if nodelist:
    assert len(nodelist) == 1
    node = nodelist[0]
    return node.getAttribute(attrname).strip(  )
  else:
    return ""

The Storage class

The Storage class is used to place an article on disk as an XML file, and to create article objects from XML files that are already on disk:

>>> from article import Article
>>> from storage import Storage
>>> a = Article(  )
>>> a.title = "FirstPost"
>>> a.contents = "This is the FirstPost."
>>> a.author = "Fred L. Drake, Jr."
>>> s = Storage(  )
>>> s.save(a)
>>>
>>> b = s.load("FirstPost.xml")
>>> print b.getXML(  )
<?xml version="1.0"?>
<article title="FirstPost">
  <author name="Fred L. Drake, Jr." />
  <contents>
This is the FirstPost.
  </contents>
</article>

Here, you create an article from scratch as a, store it to disk using the Storage object, and then reincarnate the article as b using Storage’s load method. Note that the load method takes the actual filename that is a concatenation of the article.title and the .xml extension.

The Storage.save method takes an article instance as the only parameter and saves the article to disk as an XML file using the form article.title .xml:

    sFilename = article.title + ".xml"
    fd = open(sFilename, "w")

    # write file to disk with data from getXML(  ) call
    fd.write(article.getXML(  ))
    fd.close(  )

The getXML method is used to retrieve an XML string containing an XML version of the article; the string is then saved to the disk file. The Storage.load method takes an XML file from disk, reads in the data from the file, and then creates an article using the fromXML method of the Article class:

    fd = open(sName, "r")
    sxml = fd.read(  )
    fd.close(  )

    # create an article instance
    a = Article(  )
    a.fromXML(sxml)

    # return article object to caller
    return a

The return result is an Article instance. Example 4-7 shows storage.py in its entirety.

Example 4-7. storage.py
# storage.py
from article import Article

class Storage:
  """Stores and retrieves article objects as XML files
  -- should be easy to migrate to a database."""

  def save(self, article):
    """Save as <article.title>.xml."""
    sFilename = article.title + ".xml"
    fd = open(sFilename, "w")

    # write file to disk with data from getXML(  ) call
    fd.write(article.getXML(  ))
    fd.close(  )

  def load(self, sName):
    """Name must be filename.xml--Returns an article object."""
    fd = open(sName, "r")
    sxml = fd.read(  )

    # create an article instance
    a = Article(  )

    # use fromXML to create an article object
    # from the file's XML
    a.fromXML(sxml)
    fd.close(  )

    # return article object to caller
    return a

Implementing Site Logic

The Article and Storage classes are not web-oriented. They could be used in any type of application, as the articles are represented in XML, and the Storage class just handles their I/O to disk. Conceptually at least, you could use these classes anywhere to create an XML-based information store.

On the other hand, you could write a single CGI script that has all of the logic to store articles to disk and read them, as well as parse the XML, but then your articles and their utility would be trapped within the CGI script. By breaking core functionality off into discrete components, you’re free to use the Article and Storage classes from any type of application you envision.

In order to manage web interaction with the article classes, we will create one additional class (ArticleManager) and one additional script (start.cgi). The ArticleManager class builds a web interface for article manipulation. It has the ability to display articles as HTML, to accept posted articles from a web form, and to handle user interaction with the site. The start.cgi script handles I/O from the web server and drives the ArticleManager.

The ArticleManager class

The ArticleManager class contains four methods for dealing with articles. The manager acts as a liaison between the article objects and the actual CGI script that interfaces with the web server (and, indirectly the user’s browser).

The viewAll method picks all of the XML articles off the disk and creates a section of HTML hyperlinks linking to the articles. This method is called by the CGI script to create a page showing all of the article titles as links:

def viewAll(self):
  """Displays all XML files in the current 
  working directory."""
  print "<p>View All<br><br>"

  # grab list of files in current directory
  fl = os.listdir(".")
  for xmlFile in fl:
    # weed out XML files
    tname, ext = os.path.splitext(xmlFile)
    if ext == ".xml":
      # create HTML link surrounding article name
      print '<br><a href="start.cgi?cmd=v1a&af=%s">%s</a><br>'
            % (quote(xmlFile), tname)

The method is not terribly elegant. It simply reads the contents of the current directory, picks out the XML files, and strips the .xml extension off the name before displaying it as a link. The link connects back again to the same page (start.cgi), but this time with query string parameters that instruct start.cgi to invoke the viewOne method to view the content of a single article. The quote function imported from urllib is used to escape special characters in the filename that may cause problems for the browser. URL construction and quoting is discussed in more detail in Chapter 8.

The viewOne method uses the storage object to reanimate an article stored on disk. Once the article instance is created, its data members are mined (one by one), and wrapped with HTML for display in the browser:

def viewOne(self, articleFile):
  """ takes an article file name as a parameter and
      creates and displays an article object for it. 
  """
  # create storage and article objects
  store = Storage(  )
  art = store.load(articleFile)

  # Write HTML to browser (standard output)
  print "<p>Title: " + art.title + "<br>"
  print "Author: " + art.author + "<br>"
  print "Date: " + art.time + "<br>"
  print "<table width=500><tr><td>" + art.contents
  print "</td></tr></table></p>"

It’s important to note here that the parameter handed to viewOne is a real filename, not just the title of the XML document.

The postArticle method is probably the simplest method discussed yet, as its job is simply to create HTML. The HTML represents a submittal form whereby users can write new articles and present them to the server for ultimate storage in XML. Since the HTML form does not change, this method can simply print the value of a constant that contains the form as a string.

The postArticleData method is slightly more complicated. Its job is to extract key/value pairs from a submitted HTTP form, and create an XML article based on the obtained values. Once the XML is created, it must be stored to disk. It does this by creating an article object and setting the members to values retrieved from the form, then using the Storage class to save the article.

def postArticleData(self, form):
     """Accepts actual posted form data, creates and
     stores an article object."""
     # populate an article with information from the form
     art = Article()
     art.title       = form["title"].value
     art.author      = form["author"].value
     art.contributor = form["contrib"].value
     art.contents    = form["contents"].value

     # store the article
     store = Storage()
     store.save(art)

Example 4-8 shows ArticleManager.py in its entirety.

Example 4-8. ArticleManager.py
# ArticleManager.py
import os
from urllib  import quote
from article import Article
from storage import Storage

class ArticleManager:
  """Manages articles for the web page.

  Responsible for creating, loading, saving, and displaying
  articles."""

  def viewAll(self):
    """Displays all XML files in the current working directory."""
    print "<p>View All<br><br>"

    # grab list of files in current directory
    fl = os.listdir(".")
    for xmlFile in fl:
      # weed out XML files
      tname, ext = os.path.splitext(xmlFile)
      if ext == ".xml":
        # create HTML link surrounding article name
        print '<br><a href="start.cgi?cmd=v1a&af=%s">%s</a><br>' \
              % (quote(xmlFile), tname)

  def viewOne(self, articleFile):
    """Takes an article file name as a parameter and
    creates and displays an article object for it. 
    """
    # create storage and article objects
    store = Storage(  )
    art = store.load(articleFile)

    # Write HTML to browser (standard output)
    print "<p>Title: " + art.title + "<br>"
    print "Author: " + art.author + "<br>"
    print "Date: " + art.time + "<br>"
    print "<table width=500><tr><td>" + art.contents
    print "</td></tr></table></p>"
      
  def postArticle(self):
    """Displays the article posting form."""
    print POSTING_FORM

  def postArticleData(self,form):
    """Accepts actual posted form data, creates and
    stores an article object."""
    # populate an article with information from the form
    art = Article()
    art.title       = form["title"].value
    art.author      = form["author"].value
    art.contributor = form["contrib"].value
    art.contents    = form["contents"].value

    # store the article
    store = Storage()
    store.save(art)

POSTING_FORM = '''\
<form method="POST" action="start.cgi?cmd=pd">
<p>
Title:      <br><input type="text" length="40" name="title"><br>
Contributor:<br><input type="text" length="40" name="contrib"><br>
Author:     <br><input type="text" length="40" name="author"><br>
Contents:   <br><textarea rows="15" cols="80" name="contents"></textarea><br>
<input type="submit">
</form>
'''

Controlling the Application

The CGI script is the main program for the web application. It is also the only “page” that will ever be in the browser. When the user types start.cgi in the address bar, Apache runs the script on the server.

The script begins by importing the cgi and os modules:

import cgi
import os

The script then prints the content header, as well as the opening HTML. This HTML is the same regardless of the type of operation start.cgi is performing; therefore, it is defined as the constant HEADER (not shown) and printed for every request:

# content-type header
print "Content-type: text/html"
print
print HEADER

After the common portion of the result page is printed, the query string is checked for the cmd parameter, which specifies what actions start.cgi should perform. The hyperlinks produced and sent to the browser by start.cgi are all fitted with this same parameter indicating a specific instruction such as view or post. The query string is checked using the cgi module. It is inspected to see if it contains the cmd parameter. If so, processing continues; if not, the user is presented with an error message.

query = cgi.FieldStorage(  )
if query.has_key("cmd"):
  cmd = query["cmd"][0].value

  # instantiate an ArticleManager
  am = ArticleManager(  )

The ArticleManager is instantiated as am, and command processing continues by checking cmd for its four possible values. For viewing article titles, the command sequence va is used:

# Command: viewAll - list all articles
if cmd == "va":
  am.viewAll(  )

For viewing a specific article, the command sequence v1a is used:

# Command: viewOne - view one article
if cmd == "v1a":
  aname = query["af"].value
  am.viewOne(aname)

For posting articles, a form is displayed. The CGI script looks for the pa sequence:

# Command: postArticle - view the post-article page
  if cmd == "pa":
    am.postArticle(  )

When the user submits the article form, the data is posted to the web server. The CGI script looks for the command sequence pd to indicate that the article data is posted. It then passes the CGI form to the ArticleManager’s postArticleData method:

# Command: postData - take an actual article post
if cmd == "pd":
  print "<p>Thank you for your post!</p>"
  am.postArticleData(query)

If cmd is not present in the query string, or if cmd has a value that is not one of the four, an error message is presented as the else clause to the first if statement:

else:
  # Invalid selection 
  print "<p>Your selection was not recognized</p>"

The HTML is then closed by a final print statement:

# close the HTML
print "</body></html>"

The complete listing of start.cgi is shown in Example 4-9.

Example 4-9. start.cgi
#!/usr/local/bin/python
#
# start.cgi - a Python CGI script

import cgi
import os

from ArticleManager import ArticleManager

HEADER = """\
<html>
<body>
<p>
<table cellspacing="0" cellpadding="1">
  <tr><td>
      <h1>XML Articles</h1>
    </td></tr>
  <tr><td>
      <h3><a href="start.cgi?cmd=va">View All</a>&nbsp;|&nbsp;
          <a href="start.cgi?cmd=pa">Post Article</a></h3>
    </td></tr>
</table>
"""

#
# MAIN
#

# content-type header
print "Content-type: text/html"
print
print HEADER

# retrieve query string
query = cgi.FieldStorage(  )
if query.has_key("cmd"):
  cmd = query["cmd"].value

  # instantiate an ArticleManager
  am = ArticleManager(  )

  # do something for each command

  # Command: viewAll - list all articles
  if cmd == "va":
    am.viewAll(  )

  # Command: viewOne - view one article
  if cmd == "v1a":
    aname = query["af"].value
    am.viewOne(aname)

  # Command: postArticle - view the post-article page
  if cmd == "pa":
    am.postArticle(  )

  # Command: postData - take an actual article post
  if cmd == "pd":
    print "<p>Thank you for your post!</p>"
    am.postArticleData(query)

else:
  # Invalid selection 
  print "<p>Your selection was not recognized.</p>"

# close the HTML
print "</body></html>"

Take note of the initial #!/usr/local/bin/python expression. As this is a CGI script, the operating system needs a hint on how to run it. If it is compiled C code, it could be executed by the web server; however, if it is a script, it likely needs to be handed off to the services of a script interpreter. Such is the case with Python. Note that we did not use the sh-bang line #!/usr/bin/env python; that could open a security hole when used with CGI scripts. See the documentation of Python’s cgi module for more information about CGI security issues and how to address them properly when using Python.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required