Chapter 2. Defining Your Goal and Dataset

Creating a clear definition of your annotation goal is vital for any project aiming to incorporate machine learning. When you are designing your annotation tagsets, writing guidelines, working with annotators, and training algorithms, it can be easy to become sidetracked by details and lose sight of what you want to achieve. Having a clear goal to refer back to can help, and in this chapter we will go over what you need to create a good definition of your goal, and discuss how your goal can influence your dataset. In particular, we will look at:

  • What makes a good annotation goal

  • Where to find related research

  • How your dataset reflects your annotation goals

  • Preparing the data for annotators to use

  • How much data you will need for your task

What you should be able to take away from this chapter is a clear answer to the questions “What am I trying to do?”, “How am I trying to do it?”, and “Which resources best fit my needs?”. As you progress through the MATTER cycle, the answers to these questions will probably change—corpus creation is an iterative process—but having a stated goal will help keep you from getting off track.

Defining Your Goal

In terms of the MATTER cycle, at this point we’re right at the start of “M”—being able to clearly explain what you hope to accomplish with your corpus is the first step in creating your model. While you probably already have a good idea about what you want to do, in this section we’ll give you some pointers on ...

Get Natural Language Annotation for Machine Learning now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.