Natural Language Processing with Python

○ Find out more about sequence objects using Python’s help facility. In the interpreter, type help(str), help(list), and help(tuple). This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscores; as the help documentation shows, each such function corresponds to something more familiar. For example x.__getitem__(y) is just a long-winded way of saying x[y].

○ Identify three operations that can be performed on both tuples and lists. Identify three list operations that cannot be performed on tuples. Name a context where using a list instead of a tuple generates a Python error.

○ Find out how to create a tuple consisting of a single item. There are at least two ways to do this.

○ Create a list words = ['is', 'NLP', 'fun', '?']. Use a series of assignment statements (e.g., words[1] = words[2]) and a temporary variable tmp to transform this list into the list ['NLP', 'is', 'fun', '!']. Now do the same transformation using tuple assignment.

○ Read about the built-in comparison function cmp, by typing help(cmp). How does it differ in behavior from the comparison operators?

○ Does the method for creating a sliding window of n-grams behave correctly for the two limiting cases: n = 1 and n = len(sent)?

○ We pointed out that when empty strings and empty lists occur in the condition part of an if clause, they evaluate to False. In this case, they are said to be occurring in a Boolean context. Experiment with different kinds of non-Boolean expressions in Boolean contexts, and see whether they evaluate as True or False.

○ Use the inequality operators to compare strings, e.g., 'Monty' < 'Python'. What happens when you do 'Z' < 'a'? Try pairs of strings that have a common prefix, e.g., 'Monty' < 'Montague'. Read up on “lexicographical sort” in order to understand what is going on here. Try comparing structured objects, e.g., ('Monty', 1) < ('Monty', 2). Does this behave as expected?

○ Write code that removes whitespace at the beginning and end of a string, and normalizes whitespace between words to be a single-space character.

Do this task using split() and join().
Do this task using regular expression substitutions.

○ Write a program to sort words by length. Define a helper function cmp_len which uses the cmp comparison function on word lengths.

Create a list of words and store it in a variable sent1. Now assign sent2 = sent1. Modify one of the items in sent1 and verify that sent2 has changed.

Now try the same exercise, but instead assign sent2 = sent1[:]. Modify sent1 again and see what happens to sent2. Explain.
Now define text1 to be a list of lists of strings (e.g., to represent a text consisting of multiple sentences). Now assign text2 = text1[:], assign a new value to one of the words, e.g., text1[1][1] = 'Monty'. Check what this did to text2. Explain.
Load Python’s deepcopy() function (i.e., from copy import deepcopy), consult its documentation, and test that it makes a fresh copy of any object.

Initialize an n-by-m list of lists of empty strings using list multiplication, e.g., word_table = [[''] * n] * m. What happens when you set one of its values, e.g., word_table[1][2] = "hello"? Explain why this happens. Now write an expression using range() to construct a list of lists, and show that it does not have this problem.

Write code to initialize a two-dimensional array of sets called word_vowels and process a list of words, adding each word to word_vowels[l][v] where l is the length of the word and v is the number of vowels it contains.

Write a function novel10(text) that prints any word that appeared in the last 10% of a text that had not been encountered earlier.

Write a program that takes a sentence expressed as a single string, splits it, and counts up the words. Get it to print out each word and the word’s frequency, one per line, in alphabetical order.

Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (http://en.wikipedia.org/wiki/Gematria, http://essenes.net/gemcal.htm).

Write a function gematria() that sums the numerical values of the letters of a word, according to the letter values in letter_vals:

>>> letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,
... 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,
... 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}

Process a corpus (e.g., nltk.corpus.state_union) and for each document, count how many of its words have the number 666.
Write a function decode() to process a text, randomly replacing words with their Gematria equivalents, in order to discover the “hidden meaning” of the text.

Write a function shorten(text, n) to process a text, omitting the n most frequently occurring words of the text. How readable is it?

Write code to print out an index for a lexicon, allowing someone to look up words according to their meanings (or their pronunciations; whatever properties are contained in the lexical entries).

Write a list comprehension that sorts a list of WordNet synsets for proximity to a given synset. For example, given the synsets minke_whale.n.01, orca.n.01, novel.n.01, and tortoise.n.01, sort them according to their path_distance() from right_whale.n.01.

Write a function that takes a list of words (containing duplicates) and returns a list of words (with no duplicates) sorted by decreasing frequency. E.g., if the input list contained 10 instances of the word table and 9 instances of the word chair, then table would appear before chair in the output list.

Write a function that takes a text and a vocabulary as its arguments and returns the set of words that appear in the text but not in the vocabulary. Both arguments can be represented as lists of strings. Can you do this in a single line, using set.difference()?

Import the itemgetter() function from the operator module in Python’s standard library (i.e., from operator import itemgetter). Create a list words containing several words. Now try calling: sorted(words, key=itemgetter(1)), and sorted(words, key=itemgetter(-1)). Explain what itemgetter() is doing.

Write a recursive function lookup(trie, key) that looks up a key in a trie, and returns the value it finds. Extend the function to return a word when it is uniquely determined by its prefix (e.g., vanguard is the only word that starts with vang-, so lookup(trie, 'vang') should return the same thing as lookup(trie, 'vanguard')).

Read up on “keyword linkage” (Chapter 5 of (Scott & Tribble, 2006)). Extract keywords from NLTK’s Shakespeare Corpus and using the NetworkX package, plot keyword linkage networks.

Read about string edit distance and the Levenshtein Algorithm. Try the implementation provided in nltk.edit_dist(). In what way is this using dynamic programming? Does it use the bottom-up or top-down approach? (See also http://norvig.com/spell-correct.html.)

The Catalan numbers arise in many applications of combinatorial mathematics, including the counting of parse trees (Grammar Development). The series can be defined as follows: C₀ = 1, and C_n+1 = Σ_0..n (C_iC_n-i).

Write a recursive function to compute nth Catalan number C_n.
Now write another function that does this computation using dynamic programming.
Use the timeit module to compare the performance of these functions as n increases.

● Reproduce some of the results of (Zhao & Zobel, 2007) concerning authorship identification.

● Study gender-specific lexical choice, and see if you can reproduce some of the results of http://www.clintoneast.com/articles/words.php.

● Write a recursive function that pretty prints a trie in alphabetically sorted order, for example:

chair: 'flesh'
---t: 'cat'
--ic: 'stylish'
---en: 'dog'

● With the help of the trie data structure, write a recursive function that processes text, locating the uniqueness point in each word, and discarding the remainder of each word. How much compression does this give? How readable is the resulting text?

● Obtain some raw text, in the form of a single, long string. Use Python’s textwrap module to break it up into multiple lines. Now write code to add extra spaces between words, in order to justify the output. Each line must have the same width, and spaces must be approximately evenly distributed across each line. No line can begin or end with a space.

● Develop a simple extractive summarization tool, that prints the sentences of a document which contain the highest total word frequency. Use FreqDist() to count word frequencies, and use sum to sum the frequencies of the words in each sentence. Rank the sentences according to their score. Finally, print the n highest-scoring sentences in document order. Carefully review the design of your program, especially your approach to this double sorting. Make sure the program is written as clearly as possible.

● Read the following article on semantic orientation of adjectives. Use the NetworkX package to visualize a network of adjectives with edges to indicate same versus different semantic orientation (see http://www.aclweb.org/anthology/P97-1023).

● Design an algorithm to find the “statistically improbable phrases” of a document collection (see http://www.amazon.com/gp/search-inside/sipshelp.html).

● Write a program to implement a brute-force algorithm for discovering word squares, a kind of n × n: crossword in which the entry in the nth row is the same as the entry in the nth column. For discussion, see http://itre.cis.upenn.edu/~myl/languagelog/archives/002679.html.

Natural Language Processing with Python by Steven Bird, Ewan Klein, Edward Loper

Exercises

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly