O'Reilly logo

Hands-On Natural Language Processing with Python by Rajalingappaa Shanmugamani, Rajesh Arumugam

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data

The data that is most commonly used to test and benchmark NER is the CoNLL2003 dataset, which is a shared task for language-independent NER. The dataset contains a training, development, and test file, along with a large file of unannotated data. The development file is used for tuning the parameters of the learning method, while the training data is used for training the model, using the tuned parameters, and testing on the test dataset.

The CoNLL data, split between the development and the test, is provided to avoid tuning systems on the test data. The data for the English language is taken from news stories between August 1996 and August 1997, from the Reuters Corpus. A sample sentence from the CoNLL dataset, with its accompanying ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required