Tokens
All source code is divided into a stream of tokens . The compiler tries to collect as many contiguous characters as it can to build a valid token. (This is sometimes called the “max munch” rule.) It stops when the next character it would read cannot possibly be part of the token it is reading.
A token can be an identifier, a reserved keyword, a literal, or an operator or punctuation symbol. Each kind of token is described later in this section.
Step 3 of the compilation process reads preprocessor tokens. These tokens are converted automatically to ordinary compiler tokens as part of the main compilation in Step 7. The differences between a preprocessor token and a compiler token are small:
The preprocessor and the compiler might use different encodings for character and string literals.
The compiler treats integer and floating-point literals differently; the preprocessor does not.
The preprocessor recognizes
<
header
>
as a single token (for#include
directives); the compiler does not.
Identifiers
An identifier is a name that you define or that is defined
in a library. An identifier begins with a nondigit character and is
followed by any number of digits and nondigits. A nondigit character
is a letter, an underscore, or one of a set of universal characters.
The exact set of nondigit universal characters is defined in the C++ standard and
in ISO/IEC PDTR 10176. Basically, this set contains the universal
characters that represent letters. Most programmers restrict
themselves to the characters a
...z
,
A
...Z
, and underscore, but the standard permits
letters in other languages.
Not all compilers support universal characters in identifiers.
Certain identifiers are reserved for use by the standard library:
Any identifier that contains two consecutive underscores (
like_ _this
) is reserved, that is, you cannot use such an identifier for macros, class members, global objects, or anything else.Any identifier that starts with an underscore, followed by a capital letter (A-Z) is reserved.
Any identifier that starts with an underscore is reserved in the global namespace. You can use such names in other contexts (i.e., class members and local names).
The C standard reserves some identifiers for future use. These identifiers fall into two categories: function names and macro names. Function names are reserved and should not be used as global function or object names; you should also avoid using them as "
C
" linkage names in any namespace. Note that the C standard reserves these names regardless of which headers you#include
. The reserved function names are:is
followed by a lowercase letter, such asisblank
mem
followed by a lowercase letter, such asmemxyz
str
followed by a lowercase letter, such asstrtof
to
followed by a lowercase letter, such astoxyz
wcs
followed by a lowercase letter, such aswcstof
In
<cmath>
withf
orl
appended, such ascosf
andsinl
Macro names are reserved in all contexts. Do not use any of the following reserved macro names:
Identifiers that start with
E
followed by a digit or an uppercase letterIdentifiers that start with
LC_
followed by an uppercase letterIdentifiers that start with
SIG
orSIG_
followed by an uppercase letter
Keywords
A keyword is an identifier that is reserved in all contexts for special use by the language. The following is a list of all the reserved keywords. (Note that some compilers do not implement all of the reserved keywords; these compilers allow you to use certain keywords as identifiers. See Section 1.5 later in this chapter for more information.)
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | |
Literals
A literal is an integer, floating-point, Boolean, character, or string constant.
Integer literals
An integer literal can be a decimal, octal, or
hexadecimal constant. A prefix specifies the base or radix: 0x
or 0X
for hexadecimal, 0
for octal, and nothing for decimal. An
integer literal can also have a suffix that is a combination of
U
and L
, for unsigned
and long
, respectively. The suffix can be
uppercase or lowercase and can be in any order. The suffix and
prefix are interpreted as follows:
If the suffix is
UL
(orul
,LU
, etc.), the literal’s type isunsigned
long
.If the suffix is
L
, the literal’s type islong
orunsigned
long
, whichever fits first. (That is, if the value fits in along
, the type islong
; otherwise, the type isunsigned
long
. An error results if the value does not fit in anunsigned
long
.)If the suffix is
U
, the type isunsigned
orunsigned
long
, whichever fits first.Without a suffix, a decimal integer has type
int
orlong
, whichever fits first.An octal or hexadecimal literal has type
int
,unsigned
,long
, orunsigned
long
, whichever fits first.
Some compilers offer other suffixes as extensions to the standard. See Appendix A for examples.
Here are some examples of integer literals:
314 // Legal 314u // Legal 314LU // Legal 0xFeeL // Legal 0ul // Legal 078 // Illegal: 8 is not an octal digit 032UU // Illegal: cannot repeat a suffix
Floating-point literals
A floating-point literal has an integer part, a decimal
point, a fractional part, and an exponent part. You must include the
decimal point, the exponent, or both. You must include the integer
part, the fractional part, or both. The signed exponent is
introduced by e
or E
. The literal’s type is double
unless there is a suffix: F
for type float
and L
for long
double
. The suffix can be uppercase or
lowercase.
Here are some examples of floating-point literals:
3.14159 // Legal .314159F // Legal 314159E-5L // Legal 314. // Legal 314E // Illegal: incomplete exponent 314f // Illegal: no decimal or exponent .e24 // Illegal: missing integer or fraction
Character literals
Character literals are enclosed in single quotes. If
the literal begins with L
(uppercase only), it is a wide character literal (e.g., L'x
'). Otherwise, it is a narrow character
literal (e.g., 'x
'). Narrow
characters are used more frequently than wide characters, so the
“narrow” adjective is usually dropped.
The value of a narrow or wide character literal is the value of the character’s encoding in the execution character set. If the literal contains more than one character, the literal value is implementation-defined. Note that a character might have different encodings in different locales. Consult your compiler’s documentation to learn which encoding it uses for character literals.
A narrow character literal with a single character has
type char
. With more than one
character, the type is int
(e.g.,
'abc
'). The type of a wide character literal is always wchar_t
.
Tip
In C, a character literal always has type int
. C++ changed the type of character
literals to support overloading, especially for I/O (e.g.,
cout
<<
'\n
' starts a new line and does not print
the integer value of the newline character).
A character literal can be a plain character (e.g., 'x
'), an escape sequence (e.g., '\b
'), or a universal character (e.g.,
'\u03C0
'). Table 1-1 lists the possible
escape sequences. Note that you must use an escape sequence for a
backslash or single-quote character literal. Using an escape for a
double quote or question mark is optional. Only the characters shown
in Table 1-1 are
allowed in an escape sequence. (Some compilers extend the standard
and recognize other escape sequences.)
Escape sequence | Meaning |
| |
| ' character |
| " character |
| |
| Alert or bell |
| Backspace |
| Form feed |
| Newline |
| Carriage return |
| Horizontal tab |
| Vertical tab |
| Octal number of one to three digits |
| Hexadecimal number of one or more digits |
String literals
String literals are enclosed in double quotes. A string contains characters that are similar to character literals: plain characters, escape sequences, and universal characters. A string cannot cross a line boundary in the source file, but it can contain escaped line endings (backslash followed by newline).
A wide string literal is prefaced with L
(always uppercase). In a wide string
literal, a single universal character always maps to a single wide
character. In a narrow string literal, the implementation determines
whether a universal character maps to one or multiple characters
(called a multibyte character). See Chapter 8 for more information on
multibyte characters.
Two adjacent string literals (possibly separated by whitespace, including new lines) are concatenated at compile time into a single string. This is often a convenient way to break a long string across multiple lines. Do not try to combine a narrow string with a wide string in this way.
After concatenating adjacent strings, the null character
('\0
' or L'\0
') is automatically appended after the
last character in the string literal.
Here are some examples of string literals. Note that the first three form identical strings.
"hello, reader" "hello, \ reader" "hello, " "rea" "der" "Alert: \a; ASCII tab: \010; portable tab: \t" "illegal: unterminated string L"string with \"quotes\""
A string literal’s type is an array of const
char
. For example, "string
“’s type is const
char[7]
. Wide string literals are arrays
of const
wchar_t
. All string literals have static
lifetimes (see Chapter 2 for
more information about lifetimes).
As with an array of const
anything, the compiler can automatically convert the array to a
pointer to the array’s first element. You can, for example, assign a
string literal to a suitable pointer object:
const char* ptr; ptr = "string";
As a special case, you can also convert a string literal to a
non-const
pointer. Attempting to
modify the string results in undefined behavior. This conversion is
deprecated, and well-written code does not rely on it.
Symbols
Nonalphabetic symbols are used as operators and as punctuation (e.g., statement terminators). Some symbols are made of multiple adjacent characters. The following are all the symbols used for operators and punctuation:
| | | . | | . | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | : | | | | | | | |
| | | | : | | | | | |
| | , | | | | |
You cannot insert whitespace between characters that make up a symbol, and
C++ always collects as many characters as it can to form a symbol
before trying to interpret the symbol. Thus, an expression such as
x+++y
is read as x ++ + y
. A common error when first using
templates is to omit a space between closing angle brackets in a
nested template instantiation. The following is an example with that
space:
std::list<std::vector<int> > list;↑ Note the space here.
The example is incorrect without the space character because the adjacent greater than signs would be interpreted as a single right-shift operator, not as two separate closing angle brackets. Another, slightly less common, error is instantiating a template with a template argument that uses the global scope operators:
::std::list< ::std::list<int> > list;↑ ↑ Space here and here
Again, a space is needed, this time between the angle-bracket
(<
) and the scope operator
(:
:), to prevent the compiler from
seeing the first token as <
:
rather than <
. The <
: token is an alternative token, as
described in Section 1.5
later in this chapter.
Get C++ In a Nutshell now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.