
The concept is sometimes confused with a combining character sequence—i.e., a se-
quence consisting of a base character and one or more combining characters (such as
combining accents). Although a combining character sequence could also be a text
element, that’s casual. A text element is whatever an application regards as a text ele-
ment.
Unicode Strings
The term “Unicode string” has a more technical meaning than you might expect. It
does not refer to a string (sequence) of Unicode characters (code points) but to a se-
quence of code units. Thus, the components of the string are of fixed size in bits (in
practice, 8, 16, or 32 bits). In many programming languages, Unicode strings have a
code unit size of 16 bits. This does not limit the range of characters, since such a string
could be interpreted according to UTF-16.
Thus, a component of a Unicode string need not correspond to a character. A code unit
could be part of the representation of a character (say, the second octet of a two-octet
representation in UTF-8). Even if a code unit as such represents a code point, it can be
a noncharacter or an unassigned code point.
Although a Unicode string is often in some encoding, this is not a requirement. It is
possible to consider any sequence of octets as a Unicode string, even if the sequence
does not correspond to the rules of any Unicode encoding (in practice, UTF‑8 in this
case). You could also ...