Chapter 5. Processing Words
This chapter focuses on the basic word-processing techniques you can apply to get started with NLP, including tokenization, vocabulary reduction, bag-of-words, and N-grams. You can solve many tasks with these techniques plus some basic machine learning. Knowing how, when, and why to use these techniques will help you with simple and complicated NLP tasks. This is why the discussion of the linguistics technique covers implementation. We will focus on working with English for now, though we will mention some things that should be considered when working with other languages. We are focusing on English because it would be very difficult to cover these techniques in depth across different languages.
Let’s load the data from the mini_newsgroups again, and then we will explore tokenization.
import os from pyspark.sql.types import * from pyspark.ml import Pipeline import sparknlp from sparknlp import DocumentAssembler, Finisher spark = sparknlp.start()
space_path = os.path.join('data', 'mini_newsgroups', 'sci.space') texts = spark.sparkContext.wholeTextFiles(space_path) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) texts = spark.createDataFrame(texts, schema=schema).persist()
## excerpt from mini newsgroups modified for examples example = ''' Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his american ...