Looking at Tags

In addition to all this ordinal and numeric data, we have a set of free-form tags that users are able enter about a person's picture. The tags range from descriptive ("freckles", "nosering") to crass ("takemetobed", "dirtypits") to friendly ("you.look.good.in.red") to advice ("cutyourhair", "avoidsun") to editorial ("awwdorable!!!!!", "EnoughUploadsNancy") to mean ("Thefatfriend") to nonsensical ("…", "plokmnjiuhbygvtfcrdxeszwaq"). In general, free-text data is more complicated to process.

The first thing to do is examine the distribution of the tags. What's the most common tag?

Load our tags               > face_tags = read.delim("face_tags.tsv",sep="\t",as.is=T)
then count                  > counts = table(face_tags$tag)
and rank them.              > sorted_counts = sort(counts, decreasing=T)
Show the most common tags.  > sorted_counts[1:20]

The following table contains the output.

cute 81333	pretty 40954	happy 36263	nice 33221	fun 30622	young 27900
sweet 20362	friendly 14895	cool 14709	weird 12731	hot 12662	gay 12409
Cute 12132	funny 11508	scary 11445	sexy 11287	old 10958	goofy 10511
emo 10292	shy 10207

What are the least common tags?

Show the least common tags.    > tail(sorted_count, 20)

Glancing at a few of the tags raises questions about normalization. Should "cute" and "Cute" be merged into the same tag? Should punctuation be dropped entirely? Should that funny-looking full-width question mark for Asian languages be considered ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Beautiful Data by Toby Segaran, Jeff Hammerbacher

Looking at Tags

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly