Chapter 18. Human Labeling

We’ve mentioned human labeling in parts of this book. In this chapter we will consider how humans can actually do labeling for different kinds of NLP tasks. Some of the principles—for example, guidelines—are applicable to general labeling. Most special consideration required for NLP labeling tasks is around the technical aspects and the hidden caveats when dealing with language tasks. For example, asking someone to label parts of speech requires that they understand what parts of speech are. Let’s first consider some basic issues.

It is probably worth some thought as to what your actual input is. For example, if you are labeling a document for a classification task, the input is obvious—the document. However, if you are marking named entities, humans do not need to see the whole document to find them, so you can break this up by paragraphs or even by sentences. On the other hand, coreference resolution, which we discussed in Chapter 9, may have long-distance coreferents, so you likely need to human the whole document.

Another thing to think about is whether your task requires domain expertise or just general knowledge. If you require expertise, gathering labels is likely to take more time and money. If you are unsure, you can run an experiment to find out. Have a group of nonexperts, as well as an expert (or a group of experts if possible), label a subset of the data. If the nonexperts and experts have a high enough level of agreement, then you can get ...

Get Natural Language Processing with Spark NLP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.