Once you’ve selected an algorithm and started picking out your features, then you can actually start testing your algorithm against your gold standard corpus and evaluating the results—the “Training through Evaluation” (TE) portion of the MATTER cycle. Like other parts of MATTER, the training, testing, and evaluation phases form their own, smaller cycle. After you train your algorithm on the features you select, then you can start the testing and evaluation processes.
In this chapter we’ll answer the following questions:
When is testing performed?
Why is there both a dev-test corpus and another test corpus?
What’s being evaluated once the algorithm is run?
How do you obtain an evaluation score?
What do the evaluation scores mean?
What should evaluators be aware of during these phases of the MATTER cycle?
Which scores get reported at the end of these phases?
Keep in mind that the purpose of evaluating your algorithm is not just to get a good score on your own data! The purpose is to provide testing conditions that convincingly suggest that your algorithm will perform well on other people’s data, out in the real world. So it’s important to keep track of the testing conditions, any modifications you make to your algorithm, and places in your annotation scheme that you think could be changed to improve performance later. Your algorithm getting a good “score” on your test doesn’t really matter if no one else can take the same exam!