O'Reilly logo

Embedding Perl in HTML with Mason by Ken Williams, Dave Rolsky

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Syntax: Your Very Own Lexer

A request heard every so often on the Mason users list is for some way to create an XML-based markup language that can be used with Mason and that can be compiled to a Mason component object.

Despite the panic the thought of such a thing inspires in us, in the interests of good documentation, we will show the beginnings of such a lexer.

This lexer object will make use of several modules from CPAN, including XML::SAX::ParserFactory and XML::SAX::Base . The former is what it sounds like, a factory for SAX parsers (SAX2 parsers, actually). The latter is what any SAX2 handler should use as a base class. It implements a default no-op method for all the possible SAX2 methods, allowing you to simply implement those that you need. Our lexer will be a SAX2 handler, so we will inherit from XML::SAX::Base.

A quick side note on SAX (Simple API for XML): SAX is an event-based API for parsing XML. As the parser finds XML constructs, such as tags or character data, it calls appropriate methods in a SAX handler, such as start_element() or characters(). The parser is an event producer and the handler, like our Lexer, is an event consumer. In our case, the Lexer will also be generating events for the Compiler, though these will not be SAX events.

For more information on Perl’s implementation of SAX2, see the perl-xml project on Sourceforge at http://perl-xml.sourceforge.net/.

For the purposes of our example, let’s assume that any element that is not in the mason XML namespace will be output verbatim, as will any text. For tags, we’ll just implement <mason:args>, <mason:init>, <mason:perl>, and <mason:output> in this example.[25] The <mason:init> tag will contain XML-escaped Perl code, while the <mason:args> tag will contain zero or more <mason:arg> tags. Each <mason:arg> tag will have the attributes name and default, with name being required.

We will also implement a <mason:component> tag in order to provide a single top-level containing tag for the component, which is an XML requirement.

This is only a subset of the Mason syntax set, but it’s enough to show you how to customize a fairly important part of the system.

Using these tags, we might have some XML like this:

<?xml version="1.0"?>
<mason:component xmlns:mason="http://www.masonbook.com/">
 This is plain text.
 <b>This is text in an HTML tag</b>
  my $x;
  if ($y &gt; 10) {
      $x = 10;
  } else {
      $x = 100;
 $x is <mason:output>$x</mason:output>
 $y is <mason:output>$y</mason:output>

  <mason:arg name="$y" />
  <mason:arg name="@z" default="(2,3)" />
  $y *= $_ foreach @z;

OK, that looks just beautiful!

Let’s start with the preliminaries.

package HTML::Mason::Lexer::XML;
$VERSION = '0.01';

use strict;

use HTML::Mason::Exceptions( abbr => [ qw( param_error syntax_error error ) ] );

use HTML::Mason::Lexer;
use Params::Validate qw(:all);
use XML::SAX::Base;
use XML::SAX::ParserFactory;
use base qw(HTML::Mason::Lexer XML::SAX::Base);  # Lexer comes first

As mentioned before, XML::SAX::Base provides default no-op methods for all of the possible SAX2 events, of which there are many. Since we’re not interested in most of them, it’s nice to have them safely ignored. We inherit from HTML::Mason::Lexer because it provides a few methods that the compiler class needs, such as object_id( ).

Because we’re staunch generalists, we won’t insist that the XML namespace of our tags needs to be ' mason' . We’ll let the user override this with a parameter if desired:

_ _PACKAGE_ _->valid_params
   xml_namespace => { parse => 'string', type => SCALAR, default => 'mason',
                      descr => "Prefix of XML tags indicating Mason sections" },

We don’t need to make a separate new( ) method in our module, since we can just inherit the one provided by our base Lexer class. The main action will happen in the lex( ) method:

sub lex {
    my ($self, %p) = @_;

    local $self->{name} = $p{name};
    local $self->{compiler} = $p{compiler};

We need a convenient place to keep these, so we stick them into $self for the duration of lexing. Perl’s local( ) function makes sure these entries expire at the end of the lex( ) method:

$self->{state} = [ ];

We’ll need to keep a stack of what tags we’ve seen so we can check that tags aren’t improperly nested and in order to handle characters( ) events correctly:

my $parser = XML::SAX::ParserFactory->parser( Handler => $self );

We could have created the parser object in our new() method, but to store it we would have had to save it in the lexer object’s structure, which would have created a circular reference. Doing it this way guarantees that the reference to the parser will go out of scope when we’re finished using it.

      $parser->parse_string( $p{comp_source} );

The last bit tells the parser to parse the component text we were given. That will cause the parser to in turn call methods for each SAX event that occurs while parsing the string.

Now we’ll take a look at our event-handling methods. The first is start_element( ) , which will be called whenever an XML tag is first encountered:

sub start_element {
    my $self = shift;
    my $elt  = shift;

    if ( ! defined $elt->{Prefix} ||
         $elt->{Prefix} ne $self->{xml_namespace} ) {

If we got something that isn’t in our designated namespace we’ll just pass it through to the compiler as text to be output:

if ( $elt->{LocalName} eq 'component' ) {

When the component starts, we notify the compiler so it can do any initialization that it needs to do:

foreach my $block ( qw( init perl args ) ) {
    if ( $elt->{LocalName} eq $block ) {

if ( $elt->{LocalName} eq 'output' ) {

if ( $elt->{LocalName} eq 'arg' ) {

The rest of this method is basically a switch statement. Depending on what type of element we receive, we call the appropriate internal method to handle that element.

Let’s look at some of the individual methods that are called:

sub _verbatim_start_element {
    my $self = shift;
    my $elt  = shift;
    my $xml = '<' . $elt->{Name};

    my @att;
    foreach my $att ( values %{ $elt->{Attributes} } ) {
        push @att, qq|$att->{Name}="$att->{Value}"|;

    if (@att) {
        $xml .= ' ';
        $xml .= join ' ', @att;

    $xml .= '>';

    $self->{compiler}->text( text => $xml );

Basically, this method goes through some contortions to regenerate the original XML element and then passes it on to the compiler as plain text. It should be noted that this implementation will end up converting tags like <foo/> into tag pairs like <foo></foo>. This is certainly valid XML but it may be a bit confusing to users. Unfortunately, there is no easy way to retrieve the exact text of the source document to determine how a tag was originally written, and with XML you’re not supposed to care anyway.

Back to our subclass. The next method to implement is our _start_block( ) method. This will handle the beginning of a number of blocks in a simple generic fashion:

sub _start_block {
    my $self  = shift;
    my $block = shift;

    if ( $self->{state}[-1] &&
         $self->{state}[-1] ne 'def' &&
         $self->{state}[-1] ne 'method' ) {
        syntax_error "Cannot nest a $block tag inside a $self->{state}[-1] tag";

What we are doing here is making it impossible to do something like nest a <mason:init> tag inside a <mason:perl> block. In fact, the only tags that can contain other tags are method and subcomponent definition tags, which are unimplemented in this example.

We notify the compiler that a new block has started and then push the block name onto our internal stack so we have access to it later:

$self->{compiler}->start_block( block_type => $block );

push @{ $self->{state} }, $block;

Again, we check for basic logical errors:

sub _start_output {
    my $self = shift;

    if ( $self->{state}[-1] &&
         $self->{state}[-1] ne 'def' &&
         $self->{state}[-1] ne 'method' ) {
        syntax_error "Cannot nest an output tag inside a $self->{state}[-1] tag";

Again, we push this onto the stack so we know that this was the last tag we saw:

push @{ $self->{state} }, 'output';

The variable name and default are expressed as attributes of the element. The weird '{}name' syntax is intentional. Read the Perl SAX2 spec mentioned earlier for more details on what this means.

sub _handle_argument {
    my $self = shift;
    my $elt  = shift;

    my $var = $elt->{Attributes}{'{}name'}{Value};
    my $default = $elt->{Attributes}{'{}default'}{Value};

We want to check that the variable name is a valid Perl variable name:

unless ( $var =~ /^[\$\@%][^\W\d]\w*/ ) {
    syntax_error "Invalid variable name: $var";

Then we tell the compiler that we just found a variable declaration.

$self->{compiler}->variable_declaration( block_type => 'args',
                                         type => substr( $var, 0, 1 ),
                                         name => substr( $var, 1 ),
                                         default => $default );

That wraps up all the methods that start_element( ) calls. Now let’s move on to handling a characters( ) SAX event. This happens whenever the SAX parser encounters data outside of an XML tag.

sub characters {
    my $self  = shift;
    my $chars = shift;

    if ( ! $self->{state}[-1] ||
         $self->{state}[-1] eq 'def' ||
         $self->{state}[-1] eq 'method' ) {
        $self->{compiler}->text( text => $chars->{Data} );

If we’re in the main body of a component, subcomponent, or method, we simply pass the character data on as text:

if ( $self->{state}[-1] eq 'init' ||
     $self->{state}[-1] eq 'perl' ) {
    $self->{compiler}->raw_block( block_type => $self->{state}[-1],
                                  block => $chars->{Data} );

Character data in a <mason:init> or <mason:perl> section is passed to the compiler as the contents of that block. The compiler knows what type of tag is currently being processed and handles it appropriately.

if ( $self->{state}[-1] eq 'output' ) {
    $self->{compiler}->substitution( substitution => $chars->{Data} );

If we are in a substitution tag, we call a different compiler method instead. Otherwise, we’ll simply end up discarding the contents.

Since we may be dealing with text where whitespace is significant (as opposed to HTML), we’ll want to pass on whitespace as if it were character data:

sub ignorable_whitespace { $_[0]->characters($_[1]->{Data}) }

This method may be called if the XML parser finds a chunk of “ignorable whitespace.” Frankly, we can never ignore whitespace, because it is just so cool, and without it our code would be unreadable. But apparently XML parsers can ignore it.[26]

The last thing we need to handle is an end_element( ) event:

sub end_element {
    my $self = shift;
    my $elt  = shift;

    if ( ! defined $elt->{Prefix} ||
         $elt->{Prefix} ne $self->{xml_namespace} ) {

Again, XML elements not in our designated namespace are passed on verbatim to the compiler:

if ( $elt->{LocalName} eq 'component' ) {

If we have reached the end tag of the component, we inform the compiler that the component is complete and we return:

return if $elt->{LocalName} eq 'arg';

We don’t need to do anything to end argument declarations. The work needed to handle this element happened when we called _handle_argument( ) from our start_element( ) method.

if ( $self->{state}[-1] ne $elt->{LocalName} ) {
  syntax_error "Something very weird happened.  " .
               "We encountered an ending tag for a $elt->{LocalName} tag " .
               "before ending our current tag ($self->{state}[-1]).";

Actually, this should just never happen: XML does not allow tag overlap and, if the parser finds overlapping tags, it should die rather than passing them to us. But we believe in being paranoid. If there is an error in the logic of this lexer code, this might help us in catching it.

if ( $elt->{LocalName} eq 'output' ) {
    pop @{ $self->{state} };

Any output that needed to be sent has already been dealt with via the characters( ) method so we simply need to change our state if the end tag was </mason:output>:

$self->{compiler}->end_block( block_type => $elt->{LocalName} );

pop @{ $self->{state} };

The only remaining possibilities at this point are either <mason:perl>, <mason:init>, or <mason:args>. For these we simply tell the compiler that the block is over, change our state, and finish.

The last method we need to write is _verbatim_end_element() to pass through tag endings for non-Mason tags:

sub _verbatim_end_element {
    my $self = shift;
    my $elt  = shift;

    $self->{compiler}->text( text => "</$elt->{Name}>" );

This concludes our sample lexer subclass. Note that there are a couple of things missing here. First of all, there is no handling of subcomponents or methods. This wouldn’t be too terribly hard as it’s mostly an issue of calling the right methods on the compiler.

We also would want to handle line numbers. The default Mason lexer keeps track of line numbers in the source file so that the compiler can output appropriate #line directives in the object file, meaning that errors are reported relative to the source file. This feature isn’t required but can be very nice to have.

Some of the unhandled potential tags like <mason:text> would be extremely trivial to implement. The <mason:flags> and <mason:attr> tags could be modeled on the code for handling <mason:args>. And of course, we need to handle component calls too. This is the point in this example where we say, “finishing this is left as an exercise to the reader.”

To use this new lexer class, we would either place the following in the httpd.conf file:

PerlSetVar MasonLexerClass HTML::Mason::Lexer::XML

or, when creating the ApacheHandler object, we would simply pass in 'HTML::Mason::Lexer::XML ' as the value of the lexer_class parameter.

[25] The equivalent of <% %> in the sane world where people don’t use XML for everything!

[26] See Section 2.10 of the W3C XML 1.0 Recommendation for the definition of “ignorable whitespace.”

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required