O'Reilly logo

Machine Learning in Java by Boštjan Kaluža

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Topic modeling for BBC news

As discussed earlier, the goal of topic modeling is to identify patterns in a text corpus that correspond to document topics. In this example, we will use a dataset originating from BBC news. This dataset is one of the standard benchmarks in machine learning research, and is available for non-commercial and research purposes.

The goal is to build a classifier that is able to assign a topic to an uncategorized document.

BBC dataset

Greene and Cunningham (2006) collected the BBC dataset to study a particular document-clustering challenge using support vector machines. The dataset consists of 2,225 documents from the BBC News website from 2004 to 2005, corresponding to the stories collected from five topical areas: business, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required