Which Words Are Gendered?

Many social theorists have wondered to what extent gender is reflected in language. Our data set lets us explore this at the word level: we can find which description tags are most characteristic of male or female faces. We could just count the words that occur most often for men and the words that occur most often for women, but generally this just gets words that are frequent everywhere. A better approach is to score tags by their ratio of occurrences between genders. That is, to determine how characteristic a tag T is for gender G, look at:

This has a flaw: rare tags introduce noise. For example, any tag that appears just once automatically gets a perfect score of 1 for whichever gender it appeared with. (This is another example of error due to small sample sizes that we saw for sparse age buckets.) A simple way around this is to use a frequency threshold. In this case, we'll only look at tags that occur more than 100 times.

Calculating these scores—in statistical terminology, they're maximum likelihood estimates of the conditional probabilities Pr(G|T)—we get the following tables.

Words most characteristic of men are shown in the following table.

	G	T	Ratio
daddy	122	122	1.0000000
fatherly	115	115	1.0000000
fratboy	177	177	1.0000000
father	172	173	0.9942197
dad	341	343	0.9941691
douche	229	231	0.9913420
Handsome	110	111	0.9909910
scruffy	149	151	0.9867550
bald	343	350	0.9800000
jock	395	404	0.9777228 ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Beautiful Data by

Which Words Are Gendered?

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly