4.17. Counting the Number of Characters, Words, and Lines in a Text File

Problem

You have to count the numbers of characters, words, and lines—or some other type of text element—in a text file.

Solution

Use an input stream to read the characters in, one at a time, and increment local statistics as you encounter characters, words, and line breaks. Example 4-26 contains the function countStuff, which does exactly that.

Example 4-26. Calculating statistics about a text file

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cctype>

using namespace std;

void countStuff(istream& in,
                int& chars,
                int& words,
                int& lines) {

   char cur = '\0';
   char last = '\0';
   chars = words = lines = 0;

   while (in.get(cur)) {
      if (cur == '\n' ||
          (cur == '\f' && last == '\r'))
         lines++;
      else
        chars++;
      if (!std::isalnum(cur) &&   // This is the end of a
          std::isalnum(last))     // word
         words++;
      last = cur;
   }
   if (chars > 0) {               // Adjust word and line
      if (std::isalnum(last))     // counts for special
         words++;                 // case
      lines++;
   }
}

int main(int argc, char** argv) {

   if (argc < 2)
      return(EXIT_FAILURE);

   ifstream in(argv[1]);

   if (!in)
      exit(EXIT_FAILURE);

   int c, w, l;

   countStuff(in, c, w, l);
1
   cout << "chars: " << c << '\n';
   cout << "words: " << w << '\n';
   cout << "lines: " << l << '\n';
}

Discussion

The algorithm here is straightforward. Characters are easy: increment the character count each time you call get on the input stream. Lines are only slightly more difficult, since the way a line ends depends on the operating system. Thankfully, it’s usually either a new-line character (\n) or a carriage return line feed sequence (\r\l). By keeping track of the current and last characters, you can easily capture occurrences of this sequence. Words are easy or hard, depending on your definition of a word.

For Example 4-26, I consider a word to be a contiguous sequence of alphanumeric characters. As I look at each character in the input stream, when I encounter a nonalphanumeric character, I look at the previous character to see if it was alphanumeric. If it was, then a word has just ended and I can increment the word count. I can tell if a character is alphanumeric by using isalnum from <cctype>. But that’s not all—you can test characters for a number of different qualities with similar functions. See Table 4-3 for the functions you can use to test character qualities. For wide characters, use the functions of the same name but with a “w” after the “is,” e.g., iswspace. The wide-character versions are declared in the header <cwctype>.

Table 4-3. Character test functions from <cctype> and <cwctype>

Function	Description
`isalphaiswalpha`	Alpha characters: a-z, A-Z (upper- or lowercase).
`isupperiswupper`	Alpha characters in uppercase only: A-Z.
`isloweriswlower`	Alpha characters in lowercase only: a-z.
`isdigitiswdigit`	Numeric characters: 0-9.
`isxdigitiswxdigit`	Hexadecimal numeric characters: 0-9, a-f, A-F.
`isspaceiswspace`	Whitespace characters: ' `, \n, \t, \v, \r, \l.
`iscntrliswcntrl`	Control characters: ASCII 0-31 and 127.
`ispunctiswpunct`	Punctuation characters that don’t belong to the previous groups.
`isalnumiswalnum`	`isalpha` or `isdigit` is true.
`isprintiswprint`	Printable ASCII characters.
`isgraphiswgraph`	`isalpha` or `isdigit` or `ispunct` is true.

After all characters have been read in and the end of the stream has been reached, there is a bit of adjustment to do. First, the loop only counts line breaks, and not, strictly speaking, lines. Therefore, it will always be one less than the actual number of lines. To make this problem go away I just increment the line count by one if there are more than zero characters in the file. Second, if the stream ends with an alphanumeric character, the test for the end of the last word will never occur because I can’t test the next character. To account for this, I check if the last character in the stream is alphanumeric (also only when there are more than zero characters in the file) and increment the word count by one.

The technique in Example 4-26 of using streams is nearly identical to that described in Recipe 4.14 and Recipe 4.15, but simpler since it’s just inspecting the file and not making any changes.

C++ Cookbook by D. Ryan Stephens, Christopher Diggins, Jonathan Turkanis, Jeff Cogswell

4.17. Counting the Number of Characters, Words, and Lines in a Text File

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly