Chapter 5. Processing Words

This chapter focuses on the basic word-processing techniques you can apply to get started with NLP, including tokenization, vocabulary reduction, bag-of-words, and N-grams. You can solve many tasks with these techniques plus some basic machine learning. Knowing how, when, and why to use these techniques will help you with simple and complicated NLP tasks. This is why the discussion of the linguistics technique covers implementation. We will focus on working with English for now, though we will mention some things that should be considered when working with other languages. We are focusing on English because it would be very difficult to cover these techniques in depth across different languages.

Let’s load the data from the mini_newsgroups again, and then we will explore tokenization.

import os

from pyspark.sql.types import *
from pyspark.ml import Pipeline

import sparknlp
from sparknlp import DocumentAssembler, Finisher

spark = sparknlp.start()
space_path = os.path.join('data', 'mini_newsgroups', 'sci.space')
texts = spark.sparkContext.wholeTextFiles(space_path)

schema = StructType([
    StructField('path', StringType()),
    StructField('text', StringType()),
])

texts = spark.createDataFrame(texts, schema=schema).persist()
 ## excerpt from mini newsgroups modified for examples example = ''' Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his american ...

Get Natural Language Processing with Spark NLP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.