Many social theorists have wondered to what extent gender is reflected in language. Our data set lets us explore this at the word level: we can find which description tags are most characteristic of male or female faces. We could just count the words that occur most often for men and the words that occur most often for women, but generally this just gets words that are frequent everywhere. A better approach is to score tags by their ratio of occurrences between genders. That is, to determine how characteristic a tag T is for gender G, look at:
This has a flaw: rare tags introduce noise. For example, any tag that appears just once automatically gets a perfect score of 1 for whichever gender it appeared with. (This is another example of error due to small sample sizes that we saw for sparse age buckets.) A simple way around this is to use a frequency threshold. In this case, we'll only look at tags that occur more than 100 times.
Calculating these scores—in statistical terminology, they're maximum likelihood estimates of the conditional probabilities Pr(G|T)—we get the following tables.
Words most characteristic of men are shown in the following table.