Now you can use your new knowledge of the DOM to create a simple web application. Let’s build one that allows for the posting and viewing of articles. The articles are submitted and viewed via a web browser, but stored by the web server as XML, which allows the articles to be leveraged into different information systems that process XML. HTML articles, on the other hand, are unusable outside of a web browser.
In order to run the examples in this chapter, you must
have a web server available that lets you execute CGI scripts. These
examples were designed on Apache, so the CGI scripts contain a
sh-bang line that specified the path to the
Python executable (the #!/usr/bin/python
expression at the top of
the file) so that Apache can run them just like any other CGI script.
(Understanding the term "sh-bang” requires a
little bit of knowledge of Unix history. The traditional command-line
environment for Unix was originally implemented using the sh program. The exclamation point was named
the “bang” character because it was always used after words such as
“bang” and “pow” in comic books and cartoons. Since the lines at the
top of scripts that started with #!
were interpreted by the sh
program, they came to be known as sh-bang
lines.)
You must enable the execution of your Python scripts on your web server. On Apache, this means enabling CGI within the web directory, ensuring that the actual CGI scripts contain the pointer to the Python interpreter so they run correctly, and setting the “execute” permission on the script. This last item can be accomplished using the chmod program:
$> chmod +x start.cgi
On other web servers and on Windows, you need to assign a handler to your CGI scripts so that they are executed by the Python interpreter. This may require that you name your scripts with a .py extension, as opposed to a .cgi extension, if .cgi is already assigned to another handler.
Beyond just being able to execute scripts within a web directory, the web user must also have write access to the directory for the examples to work. The examples are meant to illustrate the manipulation of XML and the ability to repurpose accessible XML into different applications.
To avoid dependency on a database in this chapter, and to provide easy access to the XML, these examples use the filesystem directly for storage. Articles are stored to disk as .xml files.
For Apache, you must give the user nobody
write access to the specific web
directory. If you are serving pages out of /home/httpd/myXMLApplication, you need to
set up something like the following:
$> mkdir /home/httpd/myXMLApplication $> chown nobody /home/httpd/myXMLApplication $> chmod 755 /home/httpd/myXMLApplication
This gives the user nobody
(the user ID that Apache runs under) write access to the directory.
There are many other ways to securely set this up; this is simply
one option. In general, for production web applications, it’s a good
idea not to give write access to web
users.
The web application is driven mainly by one script, start.cgi. The script does most of the processing, serves the content templates, and invokes the objects capable of storing and retrieving your XML articles. The primary components consist of the article object, the storage object, the article manager, the SAX-based article handler, and the start.cgi script that manages the whole process. Figure 4-2 shows a diagram of the major components.
In the next few sections, we examine the code and operation of the CGI components in detail.
The Article
class
represents an article as XML information. It’s a thin class with
methods only for creating an article from existing XML, or for
retrieving the XML that makes up the article as a string. In
addition, it has modifiable attributes that allow you to manipulate
the content of the article:
def __init__(self): """Set initial data attributes.""" self.reset( ) def reset(self): self.title = "" self.size = 0 self.time = "" # pretty-printing time string self.author = "" self.contributor = "" self.contents = ""
The attributes can be modified during the life of an article to keep you from having to create XML in your program. For example:
>>> from article import Article >>> art = Article( ) >>> art.title = "FirstPost" >>> art.contents = "This is the first article." >>> print art.getXML( ) <?xml version="1.0"?> <article title="FirstPost"> <contents> This is the first article. </contents> </article>
The getXML
method
call has the logic to recreate the XML when necessary. You can
create articles with a well-formed string of XML, or by loading a
string of XML from a disk file. The getXML
method exists as a means for you to
pull the XML back out of the object. Note the use of the escape
function, which we imported from
the xml.sax.saxutils
module; this
ensures that characters that are syntactically significant to XML
are properly encoded in the result.
def getXML(self): """Returns XML after re-assembling from data members that may have changed.""" attr = '' if self.title: attr = ' title="%s"' % escape(self.title) s = '<?xml version="1.0"?>\n<article%s>\n' % attr if self.author: s = '%s <author name="%s" />\n' % (s, escape(self.author)) if self.contributor: s = '%s <contributor name="%s" />\n' % (s, escape(self.contributor)) if self.contents: s = ('%s <contents>\n%s\n </contents>\n' % (s, escape(self.contents))) return s + "</article>\n"
The fromXML
method of the
article class populates the current XML article object with
the values from the supplied string. This method uses the
convenience function parseString
,
from xml.dom.minidom
, to load the XML data into
a document object, and then uses the content retrieval methods of
the DOM to collect the required information:
def fromXML(self, data): """Initialize using an XML document passed as a string.""" self.reset() dom = xml.dom.minidom.parseString(data) self.title = get_attribute(dom, "article", "title") self.size = int(get_attribute(dom, "size", "bytes") or 0) self.time = get_attribute(dom, "time", "stime") self.author = get_attribute(dom, "author", "name") self.contributor = get_attribute(dom, "contributor", "name") nodelist = dom.getElementsByTagName("contents") if nodelist: assert len(nodelist) == 1 contents = nodelist[0] contents.normalize() if contents.childNodes: self.contents = contents.firstChild.data.strip()
This method uses a convenience function defined elsewhere in
the module. The function get_attribute
looks into the document for
an attribute and returns the value it finds; if the attribute it is
looking for does not exist (or the element it expects to find it on
does not exist), it returns an empty string instead. If it finds
more than one element that matches the requested element type, it
complains loudly using the assert
statement. (For a real application, you would not use assert
in this way, but this is sufficient
for our examples since we’re mainly interested in the XML
aspect.)
When working with the web site logic, most manipulation on
article objects occurs by either using the Storage
class to load an article from
disk, or by parsing a form submission to create an article for a
user and then using the Storage
class to save the XML file to disk. Example 4-6 shows the complete
listing of the Article
class.
Example 4-6. Article class from article.py
import xml.dom.minidom from xml.sax.saxutils import escape class Article: """Represents a block of text and metadata created from XML.""" def __init__(self): """Set initial data properties.""" self.reset( ) def reset(self): """Re-initialize data properties.""" self.title = "" self.size = 0 self.time = "" # pretty-printing time string self.author = "" self.contributor = "" self.contents = "" def getXML(self): """Returns XML after re-assembling from data members that may have changed.""" attr = '' if self.title: attr = ' title="%s"' % escape(self.title) s = '<?xml version="1.0"?>\n<article%s>\n' % attr if self.author: s = ('<?xml version="1.0"?>\n' '<article%s>\n' % attr) if self.author: s = '%s <author name="%s" />\n' % (s, escape(self.author)) if self.contributor: s = '%s <contributor name="%s" />\n' % (s, escape(self.contributor)) if self.contents: s = ('%s <contents>\n%s\n </contents>\n' % (s, escape(self.contents))) return s + "</article>\n" def fromXML(self, data): """Initialize using an XML document passed as a string.""" self.reset( ) dom = xml.dom.minidom.parseString(data) self.title = get_attribute(dom, "article", "title") self.size = int(get_attribute(dom, "size", "bytes") or 0) self.time = get_attribute(dom, "time", "stime") self.author = get_attribute(dom, "author", "name") self.contributor = get_attribute(dom, "contributor", "name") nodelist = dom.getElementsByTagName("contents") if nodelist: assert len(nodelist) == 1 contents = nodelist[0] contents.normalize( ) if contents.childNodes: self.contents = contents.firstChild.data.strip( ) # Helper function: def get_attribute(dom, tagname, attrname): """Return the value of a solitary element & attribute, if available.""" nodelist = dom.getElementsByTagName(tagname) if nodelist: assert len(nodelist) == 1 node = nodelist[0] return node.getAttribute(attrname).strip( ) else: return ""
The Storage
class
is used to place an article on disk as an XML file, and to create
article objects from XML files that are already on disk:
>>> from article import Article >>> from storage import Storage >>> a = Article( ) >>> a.title = "FirstPost" >>> a.contents = "This is the FirstPost." >>> a.author = "Fred L. Drake, Jr." >>> s = Storage( ) >>> s.save(a) >>> >>> b = s.load("FirstPost.xml") >>> print b.getXML( ) <?xml version="1.0"?> <article title="FirstPost"> <author name="Fred L. Drake, Jr." /> <contents> This is the FirstPost. </contents> </article>
Here, you create an article from scratch as a
, store it to disk using the Storage
object, and then reincarnate the
article as b
using Storage
’s load
method. Note that the load
method takes the actual filename that
is a concatenation of the article.title
and the .xml extension.
The Storage.save
method
takes an article instance as the only parameter and saves the
article to disk as an XML file using the form
article.title
.xml
:
sFilename = article.title + ".xml" fd = open(sFilename, "w") # write file to disk with data from getXML( ) call fd.write(article.getXML( )) fd.close( )
The getXML
method
is used to retrieve an XML string containing an XML version of the
article; the string is then saved to the disk file. The Storage.load
method takes an XML file from
disk, reads in the data from the file, and then creates an article
using the fromXML
method of the
Article
class:
fd = open(sName, "r") sxml = fd.read( ) fd.close( ) # create an article instance a = Article( ) a.fromXML(sxml) # return article object to caller return a
The return result is an Article
instance. Example 4-7 shows storage.py in its entirety.
Example 4-7. storage.py
# storage.py from article import Article class Storage: """Stores and retrieves article objects as XML files -- should be easy to migrate to a database.""" def save(self, article): """Save as <article.title>.xml.""" sFilename = article.title + ".xml" fd = open(sFilename, "w") # write file to disk with data from getXML( ) call fd.write(article.getXML( )) fd.close( ) def load(self, sName): """Name must be filename.xml--Returns an article object.""" fd = open(sName, "r") sxml = fd.read( ) # create an article instance a = Article( ) # use fromXML to create an article object # from the file's XML a.fromXML(sxml) fd.close( ) # return article object to caller return a
The Article
and
Storage
classes are not
web-oriented. They could be used in any type of application, as the
articles are represented in XML, and the Storage
class just handles their I/O to
disk. Conceptually at least, you could use these classes anywhere to
create an XML-based information store.
On the other hand, you could write a single CGI script that has
all of the logic to store articles to disk and read them, as well as
parse the XML, but then your articles and their utility would be
trapped within the CGI script. By breaking core functionality off into
discrete components, you’re free to use the Article
and Storage
classes from any type of application
you envision.
In order to manage web interaction with the article classes, we
will create one additional class (ArticleManager
) and one additional script
(start.cgi). The ArticleManager
class builds a web interface
for article manipulation. It has the ability to display articles as
HTML, to accept posted articles from a web form, and to handle user
interaction with the site. The start.cgi script handles I/O from the web
server and drives the ArticleManager
.
The ArticleManager
class contains four methods for dealing with articles. The manager
acts as a liaison between the article objects and the actual CGI
script that interfaces with the web server (and, indirectly the
user’s browser).
The viewAll
method
picks all of the XML articles off the disk and creates a section of
HTML hyperlinks linking to the articles. This method is called by
the CGI script to create a page showing all of the article titles as
links:
def viewAll(self): """Displays all XML files in the current working directory.""" print "<p>View All<br><br>" # grab list of files in current directory fl = os.listdir(".") for xmlFile in fl: # weed out XML files tname, ext = os.path.splitext(xmlFile) if ext == ".xml": # create HTML link surrounding article name print '<br><a href="start.cgi?cmd=v1a&af=%s">%s</a><br>' % (quote(xmlFile), tname)
The method is not terribly elegant. It simply reads the
contents of the current directory, picks out the XML files, and
strips the .xml extension off
the name before displaying it as a link. The link connects back
again to the same page (start.cgi), but this time with query
string parameters that instruct start.cgi to invoke the viewOne
method to view the content of a
single article. The quote
function imported from urllib
is
used to escape special characters in the filename that may cause
problems for the browser. URL construction and quoting is discussed
in more detail in Chapter
8.
The viewOne
method
uses the storage object to reanimate an article stored on disk. Once
the article instance is created, its data members are mined (one by
one), and wrapped with HTML for display in the browser:
def viewOne(self, articleFile): """ takes an article file name as a parameter and creates and displays an article object for it. """ # create storage and article objects store = Storage( ) art = store.load(articleFile) # Write HTML to browser (standard output) print "<p>Title: " + art.title + "<br>" print "Author: " + art.author + "<br>" print "Date: " + art.time + "<br>" print "<table width=500><tr><td>" + art.contents print "</td></tr></table></p>"
It’s important to note here that the parameter handed to
viewOne
is a real filename, not
just the title of the XML document.
The postArticle
method is probably the simplest method discussed yet, as its job is
simply to create HTML. The HTML represents a submittal form whereby
users can write new articles and present them to the server for
ultimate storage in XML. Since the HTML form does not change, this
method can simply print the value of a constant that contains the
form as a string.
The postArticleData
method
is slightly more complicated. Its job is to extract key/value pairs
from a submitted HTTP form, and create an XML article based on the
obtained values. Once the XML is created, it must be stored to disk.
It does this by creating an article object and setting the members
to values retrieved from the form, then using the Storage
class to save the article.
def postArticleData(self, form): """Accepts actual posted form data, creates and stores an article object.""" # populate an article with information from the form art = Article() art.title = form["title"].value art.author = form["author"].value art.contributor = form["contrib"].value art.contents = form["contents"].value # store the article store = Storage() store.save(art)
Example 4-8 shows ArticleManager.py in its entirety.
Example 4-8. ArticleManager.py
# ArticleManager.py import os from urllib import quote from article import Article from storage import Storage class ArticleManager: """Manages articles for the web page. Responsible for creating, loading, saving, and displaying articles.""" def viewAll(self): """Displays all XML files in the current working directory.""" print "<p>View All<br><br>" # grab list of files in current directory fl = os.listdir(".") for xmlFile in fl: # weed out XML files tname, ext = os.path.splitext(xmlFile) if ext == ".xml": # create HTML link surrounding article name print '<br><a href="start.cgi?cmd=v1a&af=%s">%s</a><br>' \ % (quote(xmlFile), tname) def viewOne(self, articleFile): """Takes an article file name as a parameter and creates and displays an article object for it. """ # create storage and article objects store = Storage( ) art = store.load(articleFile) # Write HTML to browser (standard output) print "<p>Title: " + art.title + "<br>" print "Author: " + art.author + "<br>" print "Date: " + art.time + "<br>" print "<table width=500><tr><td>" + art.contents print "</td></tr></table></p>" def postArticle(self): """Displays the article posting form.""" print POSTING_FORM def postArticleData(self,form): """Accepts actual posted form data, creates and stores an article object.""" # populate an article with information from the form art = Article() art.title = form["title"].value art.author = form["author"].value art.contributor = form["contrib"].value art.contents = form["contents"].value # store the article store = Storage() store.save(art) POSTING_FORM = '''\ <form method="POST" action="start.cgi?cmd=pd"> <p> Title: <br><input type="text" length="40" name="title"><br> Contributor:<br><input type="text" length="40" name="contrib"><br> Author: <br><input type="text" length="40" name="author"><br> Contents: <br><textarea rows="15" cols="80" name="contents"></textarea><br> <input type="submit"> </form> '''
The CGI script is the main program for the web application. It is also the only “page” that will ever be in the browser. When the user types start.cgi in the address bar, Apache runs the script on the server.
The script begins by importing the cgi
and os
modules:
import cgi import os
The script then prints the content header, as well as the
opening HTML. This HTML is the same regardless of the type of
operation start.cgi is
performing; therefore, it is defined as the constant HEADER
(not shown) and printed for every
request:
# content-type header print "Content-type: text/html" print print HEADER
After the common portion of the result page is printed, the
query string is checked for the cmd
parameter, which specifies what actions start.cgi should perform. The hyperlinks
produced and sent to the browser by start.cgi are all fitted with this same
parameter indicating a specific instruction such as view
or post
. The query string is checked using the
cgi
module. It is inspected to see
if it contains the cmd
parameter.
If so, processing continues; if not, the user is presented with an
error message.
query = cgi.FieldStorage( ) if query.has_key("cmd"): cmd = query["cmd"][0].value # instantiate an ArticleManager am = ArticleManager( )
The ArticleManager
is
instantiated as am
, and command
processing continues by checking cmd
for its four possible values. For
viewing article titles, the command sequence va
is used:
# Command: viewAll - list all articles if cmd == "va": am.viewAll( )
For viewing a specific article, the command sequence v1a
is used:
# Command: viewOne - view one article if cmd == "v1a": aname = query["af"].value am.viewOne(aname)
For posting articles, a form is displayed. The CGI script looks
for the pa
sequence:
# Command: postArticle - view the post-article page if cmd == "pa": am.postArticle( )
When the user submits the article form, the data is posted to
the web server. The CGI script looks for the command sequence pd
to indicate that the article data is
posted. It then passes the CGI form to the ArticleManager
’s postArticleData
method:
# Command: postData - take an actual article post if cmd == "pd": print "<p>Thank you for your post!</p>" am.postArticleData(query)
If cmd
is not present in the
query string, or if cmd
has a value
that is not one of the four, an error message is presented as the
else
clause to the first if
statement:
else: # Invalid selection print "<p>Your selection was not recognized</p>"
The HTML is then closed by a final print statement:
# close the HTML print "</body></html>"
The complete listing of start.cgi is shown in Example 4-9.
Example 4-9. start.cgi
#!/usr/local/bin/python # # start.cgi - a Python CGI script import cgi import os from ArticleManager import ArticleManager HEADER = """\ <html> <body> <p> <table cellspacing="0" cellpadding="1"> <tr><td> <h1>XML Articles</h1> </td></tr> <tr><td> <h3><a href="start.cgi?cmd=va">View All</a> | <a href="start.cgi?cmd=pa">Post Article</a></h3> </td></tr> </table> """ # # MAIN # # content-type header print "Content-type: text/html" print print HEADER # retrieve query string query = cgi.FieldStorage( ) if query.has_key("cmd"): cmd = query["cmd"].value # instantiate an ArticleManager am = ArticleManager( ) # do something for each command # Command: viewAll - list all articles if cmd == "va": am.viewAll( ) # Command: viewOne - view one article if cmd == "v1a": aname = query["af"].value am.viewOne(aname) # Command: postArticle - view the post-article page if cmd == "pa": am.postArticle( ) # Command: postData - take an actual article post if cmd == "pd": print "<p>Thank you for your post!</p>" am.postArticleData(query) else: # Invalid selection print "<p>Your selection was not recognized.</p>" # close the HTML print "</body></html>"
Take note of the initial #!/usr/local/bin/python
expression. As this
is a CGI script, the operating system needs a hint on how to run it.
If it is compiled C code, it could be executed by the web server;
however, if it is a script, it likely needs to be handed off to the
services of a script interpreter. Such is the case with Python. Note
that we did not use the sh-bang line #!/usr/bin/env python
; that could open a
security hole when used with CGI scripts. See the documentation of
Python’s cgi
module for more
information about CGI security issues and how to address them properly
when using Python.
Get Python & XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.