O'Reilly logo

Hands-On Natural Language Processing with Python by Rajalingappaa Shanmugamani, Rajesh Arumugam

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Preparation of text data

As is typical in NLP tasks, all strings are converted into lowercase. Since the model will be considering sequences of characters (and not sequences of words), we obtain the training vocabulary as the set of unique characters used in the dataset. We add a character, P, that corresponds to padding, since we will need to define a fixed input length, NB_CHARS_MAX, and pad strings that are smaller than that:

list_of_existing_chars = list(set(texts.str.cat(sep=' ')))vocabulary = ''.join(list_of_existing_chars)vocabulary += 'P' # add padding character

Each character is then associated with an integer that will represent it:

# Create association between vocabulary and idvocabulary_id = {}i = 0for char in list(vocabulary): ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required