Chapter 5. Applying and Adopting Annotation Standards

Now that you’ve created the spec for your annotation goal, you’re almost ready to actually start annotating your corpus. However, before you get to annotating you need to consider what form your annotated data will take—that is to say, you know what you want your annotators to do, but you have to decide how you want them to do it. In this chapter we’ll examine the different formats annotation can take, and discuss the pros and cons of each one by answering the following questions:

  • What does annotation look like?

  • Are different types of tasks represented differently? If so, how?

  • How can you ensure that your annotation can be used by other people and in conjunction with other tasks?

  • What considerations go into deciding on an annotation environment and data format, both for the annotators and for machine learning?

Before getting into the details of how to apply your spec to your corpus, you need to understand what annotation actually looks like when it has been applied to a document or text. So now let’s look at the spec examples from Chapter 4 and see how they can be applied to an actual corpus.

There are many different ways to represent information about a corpus. The examples we show you won’t be exhaustive, but they will give you an overview of some of the different formats that annotated data can take.


Keep your data accessible. Your annotation project will be much easier to manage if you choose a format for your data that’s ...

Get Natural Language Annotation for Machine Learning now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.