O'Reilly logo

Clojure Data Analysis Cookbook - Second Edition by Eric Rochester

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 10. Working with Unstructured and Textual Data

In this chapter, we will cover the following recipes:

  • Tokenizing text
  • Finding sentences
  • Focusing on content words with stoplists
  • Getting document frequencies
  • Scaling document frequencies by document size
  • Scaling document frequencies with TF-IDF
  • Finding people, places, and things with Named Entity Recognition
  • Mapping documents to a sparse vector space representation
  • Performing topic modeling with MALLET
  • Performing naïve Bayesian classification with MALLET

Introduction

We've been talking about all of the data that's out there in the world. However, structured or semistructured data—the kind you'd find in spreadsheets or in tables on web pages—is vastly overshadowed by the unstructured data that's being ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required