How to do it...

  1. Initialize a new Python file by importing the following file:
import numpy as np 
from nltk.corpus import brown 
from chunking import splitter 
  1. Define the main function and read the input data from Brown corpus:
if __name__=='__main__': 
        content = ' '.join(brown.words()[:10000]) 
  1. Split the text content into chunks:
    num_of_words = 2000 
    num_chunks = [] 
    count = 0 
    texts_chunk = splitter(content, num_of_words) 
  1. Build a vocabulary based on these text chunks:
    for text in texts_chunk: 
      num_chunk = {'index': count, 'text': text} 
      num_chunks.append(num_chunk) 
      count += 1
  1. Extract a document word matrix, which effectively counts the amount of incidences of each word in the document:
  from sklearn.feature_extraction.text       import ...

Get Raspberry Pi 3 Cookbook for Python Programmers - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.