O'Reilly logo

XML Hacks by Michael Fitzgerald

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Convert Text to XML with Uphill

This hack is a little different. It shows you how to convert plain text to XML using Dave Pawson’s Java program, Uphill. Along the way, Dave also explains how and why he developed the software, which may be helpful for those developing their own text-to-XML packages in Java.

Text without any formatting is boring and repetitive to mark up XML—just the sort of problem that a computer is good at, except that most text is not regular, which is the cost side of automation. I decided to try to create a solution in which the cost would be less for any automated solution over a by-hand conversion. That’s why I wrote Uphill (http://www.dpawson.co.uk/java/uphill/), a Java program for converting plain text into XML.

The goal for the program was to output a new file containing the XML markup for headings, paragraphs, and acronyms (needed for Braille output). First, I prototyped a solution with Python (http://www.python.org/) because Python has dictionaries that can be preloaded. I had a list of acronyms that I quickly converted into a Python structure to initialize a dictionary. The match I used was:

if acrs.has_key(str[i:i+4]):

I walked the input string, testing for four-letter, then three-letter, then two-letter acronyms. It worked, and though it was weak, it gave me enough confidence to move on.

A line from my acronym file looks like this:

USA:<acr>USA</acr>

That is, the acronym USA is marked up with the acr tag. I realized that some acronyms may be generalized. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required