Data
The data that is most commonly used to test and benchmark NER is the CoNLL2003 dataset, which is a shared task for language-independent NER. The dataset contains a training, development, and test file, along with a large file of unannotated data. The development file is used for tuning the parameters of the learning method, while the training data is used for training the model, using the tuned parameters, and testing on the test dataset.
The CoNLL data, split between the development and the test, is provided to avoid tuning systems on the test data. The data for the English language is taken from news stories between August 1996 and August 1997, from the Reuters Corpus. A sample sentence from the CoNLL dataset, with its accompanying ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access