
Breaking the Flow of Text
Markup can be used even for parts of words. Should it affect the way in which the
textual content is processed, such as recognition of words? Consider the (old-fash-
ioned) HTML markup <b>F</b>oo, intended to make the word Foo appear so that first
letter is bold. Could search engines, for example, treat it as two words, “F” and “oo”?
Search engines generally parse HTML in a manner that effectively ignores most tags.
It is however possible that some programs do otherwise, either because they have poorly
written parsers or because they have intentionally been programmed to honor markup,
in a way. The latter would be quite natural for markup like <p>xxx</p><p>yyy</p>,
where the two elements should be treated as paragraphs and the strings xxx and yyy
as separate, not as xxxyyy.
In practice, search engines differ. Google treats <b>F</b>oo as “Foo,” whereas AltaVista
treats it as two words, “F oo.” Moreover, search engine behavior may vary by situation
and version. It is thus best to avoid using markup that breaks words, unless you have
real need for it.
For a markup language like HTML, it would be natural to think that inline (text-level)
markup (like b for bold face font) does not separate characters in any way, whereas
block-level markup (like p for paragraph) acts as a separator. However, neither HTML
specifications nor the Unicode standard discuss this issue, and ...