You have to count the numbers of characters, words, and lines—or some other type of text element—in a text file.
Use an input stream to read the characters in, one at a time, and increment local
statistics as you encounter characters, words, and line breaks. Example 4-26 contains the function countStuff
, which does exactly that.
Example 4-26. Calculating statistics about a text file
#include <iostream> #include <fstream> #include <cstdlib> #include <cctype> using namespace std; void countStuff(istream& in, int& chars, int& words, int& lines) { char cur = '\0'; char last = '\0'; chars = words = lines = 0; while (in.get(cur)) { if (cur == '\n' || (cur == '\f' && last == '\r')) lines++; else chars++; if (!std::isalnum(cur) && // This is the end of a std::isalnum(last)) // word words++; last = cur; } if (chars > 0) { // Adjust word and line if (std::isalnum(last)) // counts for special words++; // case lines++; } } int main(int argc, char** argv) { if (argc < 2) return(EXIT_FAILURE); ifstream in(argv[1]); if (!in) exit(EXIT_FAILURE); int c, w, l; countStuff(in, c, w, l); 1 cout << "chars: " << c << '\n'; cout << "words: " << w << '\n'; cout << "lines: " << l << '\n'; }
The algorithm here is straightforward. Characters are easy: increment the character
count each time you call get
on the input stream. Lines
are only slightly more difficult, since the way a line ends depends on the operating
system. Thankfully, it’s usually either a new-line character (\n
) or a carriage return line feed sequence (\r\l
). By keeping track of the current and last characters, you can easily
capture occurrences of this sequence. Words are easy or hard, depending on your definition
of a word.
For Example 4-26, I consider a word
to be a contiguous sequence of alphanumeric characters. As I look at each character in the
input stream, when I encounter a nonalphanumeric character, I look at the previous
character to see if it was alphanumeric. If it was, then a word has just ended and I can
increment the word count. I can tell if a character is alphanumeric by using isalnum
from <cctype>
. But that’s not all—you can test characters for a number of
different qualities with similar functions. See Table 4-3 for the functions you can use to
test character qualities. For wide characters, use the functions of the same name but with
a “w” after the “is,” e.g., iswspace
. The
wide-character versions are declared in the header <cwctype>
.
Table 4-3. Character test functions from <cctype> and <cwctype>
Function |
Description |
---|---|
|
Alpha characters: a-z, A-Z (upper- or lowercase). |
|
Alpha characters in uppercase only: A-Z. |
|
Alpha characters in lowercase only: a-z. |
|
Numeric characters: 0-9. |
|
Hexadecimal numeric characters: 0-9, a-f, A-F. |
|
Whitespace characters: ' `, \n, \t, \v, \r, \l. |
|
Control characters: ASCII 0-31 and 127. |
|
Punctuation characters that don’t belong to the previous groups. |
|
|
|
Printable ASCII characters. |
|
|
After all characters have been read in and the end of the stream has been reached, there is a bit of adjustment to do. First, the loop only counts line breaks, and not, strictly speaking, lines. Therefore, it will always be one less than the actual number of lines. To make this problem go away I just increment the line count by one if there are more than zero characters in the file. Second, if the stream ends with an alphanumeric character, the test for the end of the last word will never occur because I can’t test the next character. To account for this, I check if the last character in the stream is alphanumeric (also only when there are more than zero characters in the file) and increment the word count by one.
The technique in Example 4-26 of using streams is nearly identical to that described in Recipe 4.14 and Recipe 4.15, but simpler since it’s just inspecting the file and not making any changes.
Get C++ Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.