Use operator>>
, defined in <string>
, to read contiguous chunks of text from the
input file, and use a map
, defined in <map>
, to store each word and its frequency in the file.
Example 4-27 demonstrates how to do
this.
Example 4-27. Counting word frequencies
1 #include <iostream> 2 #include <fstream> 3 #include <map> 4 #include <string> 5 6 typedef std::map<std::string, int> StrIntMap; 7 8 void countWords(std::istream& in, StrIntMap& words) { 9 10 std::string s; 11 12 while (in >> s) { 13 ++words[s]; 14 } 15 } 16 17 int main(int argc, char** argv) { 18 19 if (argc < 2) 20 return(EXIT_FAILURE); 21 22 std::ifstream in(argv[1]); 23 24 if (!in) 25 exit(EXIT_FAILURE); 26 27 StrIntMap w; 28 countWords(in, w); 29 30 for (StrIntMap::iterator p = w.begin(); 31 p != w.end(); ++p) { 32 std::cout << p->first << " occurred " 33 << p->second << " times.\n"; 34 } 35 }
Example 4-27 looks simple enough, but
there is more going on than it appears. Most of the subtleties have to do with map
s, so let’s talk about them first.
If you’re not familiar with map
s, you should be. A
map
is a container class template that is part of the
STL. It stores key-value pairs in order according to std::less
, or your custom comparison function, should you supply one. The
kinds of keys and values you can store in it depend only on your imagination; in this
example, we are just going to store string
s and
int
s.
I used a typedef
on line 6 to make the code
cleaner:
typedef map<string, int> StrIntMap;
Thus, a StrIntMap
is a map
that stores string
/int
pairs. Each string
is a
unique word—which is why I’m using it as the key—that has been read in and its associated
int
is the number of times it occurs. All that’s left
is to read in each of the words one-at-a-time, add it to the map
if it’s not already there, and increment its associated count value if it
is.
This is what countWords
does. The essential logic
is brief:
while (in >> s) { ++words[s]; }
operator>>
reads in contiguous chunks of
non-whitespace from its lefthand side operand (an istream
) and places them in the righthand side operand (a string
). Once I’ve read a “word,” all I have to do is update
the statistics in the map, and that is done with the following line:
++words[s];
map
defines operator[]
for retrieving a value given a key (it actually returns a
reference to the value itself), so to increment it, just increment the value indexed at
the particular key. But something about this might seem a little weird. What if the key
isn’t already in the map? Don’t we try to increment a nonexistent index, and crash like we
would with an array? No, map
does operator[]
differently than other STL containers or ordinary,
C-style arrays.
In a map
, operator[]
does two things: if the key does not already exist, it creates a
value by using that value’s type’s default constructor and adds that key/value pair to the
map, if the key already exists in the map, no modification is made. In both cases, a
reference to the value for the specified key is returned, even if that value was just
created with its default constructor. This is a handy feature (if you know it’s there),
because it eliminates the need for client code to check for a key’s existence before
inserting it.
Now, look at lines 32 and 33. The iterator refers to members called first
and second--
what are
those? map
s play a trick on you by using another class
template to store your name value pairs: the pair
class
template defined in <utility>
(included by
<map>
already). If you are iterating through
the items stored in a map, you will be pointing to pair
objects. Working with pair
s is simple, the first item
in a pair is stored in the first
member, and the second
is stored in, well, second
.
I used operator>>
in Example 4-27 to read in contiguous chunks of
text from the input stream, which is different than some of the other examples. I did this
to demonstrate that it can be done, but you will almost certainly need to customize the
behavior based on your definition of a “word” in a text file. For example, consider an
excerpt of the output produced by Example
4-27:
with occurred 5 times. work occurred 3 times. workers occurred 3 times. workers. occurred 1 times. years occurred 2 times. years. occurred 1 times.
Notice that the periods on the end of words are included as part of each word. Most
likely, you will want to change the definition of words to mean only alpha or alphanumeric
characters, as I did in Recipe
Recipe 4.17 by using some of the
character-testing functions in <cctype>
and
<cwctype>
.
Recipe 4.17 and Table 4-3
Get C++ Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.