○ Find out more about sequence objects using Python’s help facility. In the interpreter, type
help(str)
,help(list)
, andhelp(tuple)
. This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscores; as the help documentation shows, each such function corresponds to something more familiar. For examplex.__getitem__(y)
is just a long-winded way of sayingx[y]
.○ Identify three operations that can be performed on both tuples and lists. Identify three list operations that cannot be performed on tuples. Name a context where using a list instead of a tuple generates a Python error.
○ Find out how to create a tuple consisting of a single item. There are at least two ways to do this.
○ Create a list
words = ['is', 'NLP', 'fun', '?']
. Use a series of assignment statements (e.g.,words[1] = words[2]
) and a temporary variabletmp
to transform this list into the list['NLP', 'is', 'fun', '!']
. Now do the same transformation using tuple assignment.○ Read about the built-in comparison function
cmp
, by typinghelp(cmp)
. How does it differ in behavior from the comparison operators?○ Does the method for creating a sliding window of n-grams behave correctly for the two limiting cases: n = 1 and n =
len(sent)
?○ We pointed out that when empty strings and empty lists occur in the condition part of an
if
clause, they evaluate toFalse
. In this case, they are said to be occurring in a Boolean context. Experiment with different kinds of non-Boolean expressions in Boolean contexts, and see whether they evaluate asTrue
orFalse
.○ Use the inequality operators to compare strings, e.g.,
'Monty' < 'Python'
. What happens when you do'Z' < 'a'
? Try pairs of strings that have a common prefix, e.g.,'Monty' < 'Montague'
. Read up on “lexicographical sort” in order to understand what is going on here. Try comparing structured objects, e.g.,('Monty', 1) < ('Monty', 2)
. Does this behave as expected?○ Write code that removes whitespace at the beginning and end of a string, and normalizes whitespace between words to be a single-space character.
Do this task using
split()
andjoin()
.Do this task using regular expression substitutions.
○ Write a program to sort words by length. Define a helper function
cmp_len
which uses thecmp
comparison function on word lengths.Create a list of words and store it in a variable
sent1
. Now assignsent2 = sent1
. Modify one of the items insent1
and verify thatsent2
has changed.Now try the same exercise, but instead assign
sent2 = sent1[:]
. Modifysent1
again and see what happens tosent2
. Explain.Now define
text1
to be a list of lists of strings (e.g., to represent a text consisting of multiple sentences). Now assigntext2 = text1[:]
, assign a new value to one of the words, e.g.,text1[1][1] = 'Monty'
. Check what this did totext2
. Explain.Load Python’s
deepcopy()
function (i.e.,from copy import deepcopy
), consult its documentation, and test that it makes a fresh copy of any object.
Initialize an n-by-m list of lists of empty strings using list multiplication, e.g.,
word_table = [[''] * n] * m
. What happens when you set one of its values, e.g.,word_table[1][2] = "hello"
? Explain why this happens. Now write an expression usingrange()
to construct a list of lists, and show that it does not have this problem.Write code to initialize a two-dimensional array of sets called
word_vowels
and process a list of words, adding each word toword_vowels[l][v]
wherel
is the length of the word andv
is the number of vowels it contains.Write a function
novel10(text)
that prints any word that appeared in the last 10% of a text that had not been encountered earlier.Write a program that takes a sentence expressed as a single string, splits it, and counts up the words. Get it to print out each word and the word’s frequency, one per line, in alphabetical order.
Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (http://en.wikipedia.org/wiki/Gematria, http://essenes.net/gemcal.htm).
Write a function
gematria()
that sums the numerical values of the letters of a word, according to the letter values inletter_vals
:>>> letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8, ... 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100, ... 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}
Process a corpus (e.g.,
nltk.corpus.state_union
) and for each document, count how many of its words have the number 666.Write a function
decode()
to process a text, randomly replacing words with their Gematria equivalents, in order to discover the “hidden meaning” of the text.
Write a function
shorten(text, n)
to process a text, omitting the n most frequently occurring words of the text. How readable is it?Write code to print out an index for a lexicon, allowing someone to look up words according to their meanings (or their pronunciations; whatever properties are contained in the lexical entries).
Write a list comprehension that sorts a list of WordNet synsets for proximity to a given synset. For example, given the synsets
minke_whale.n.01
,orca.n.01
,novel.n.01
, andtortoise.n.01
, sort them according to theirpath_distance()
fromright_whale.n.01
.Write a function that takes a list of words (containing duplicates) and returns a list of words (with no duplicates) sorted by decreasing frequency. E.g., if the input list contained 10 instances of the word
table
and 9 instances of the wordchair
, thentable
would appear beforechair
in the output list.Write a function that takes a text and a vocabulary as its arguments and returns the set of words that appear in the text but not in the vocabulary. Both arguments can be represented as lists of strings. Can you do this in a single line, using
set.difference()
?Import the
itemgetter()
function from theoperator
module in Python’s standard library (i.e.,from operator import itemgetter
). Create a listwords
containing several words. Now try calling:sorted(words, key=itemgetter(1))
, andsorted(words, key=itemgetter(-1))
. Explain whatitemgetter()
is doing.Write a recursive function
lookup(trie, key)
that looks up a key in a trie, and returns the value it finds. Extend the function to return a word when it is uniquely determined by its prefix (e.g.,vanguard
is the only word that starts withvang-
, solookup(trie, 'vang')
should return the same thing aslookup(trie, 'vanguard')
).Read up on “keyword linkage” (Chapter 5 of (Scott & Tribble, 2006)). Extract keywords from NLTK’s Shakespeare Corpus and using the NetworkX package, plot keyword linkage networks.
Read about string edit distance and the Levenshtein Algorithm. Try the implementation provided in
nltk.edit_dist()
. In what way is this using dynamic programming? Does it use the bottom-up or top-down approach? (See also http://norvig.com/spell-correct.html.)The Catalan numbers arise in many applications of combinatorial mathematics, including the counting of parse trees (Grammar Development). The series can be defined as follows: C0 = 1, and Cn+1 = Σ0..n (CiCn-i).
Write a recursive function to compute nth Catalan number Cn.
Now write another function that does this computation using dynamic programming.
Use the
timeit
module to compare the performance of these functions as n increases.
● Reproduce some of the results of (Zhao & Zobel, 2007) concerning authorship identification.
● Study gender-specific lexical choice, and see if you can reproduce some of the results of http://www.clintoneast.com/articles/words.php.
● Write a recursive function that pretty prints a trie in alphabetically sorted order, for example:
chair: 'flesh' ---t: 'cat' --ic: 'stylish' ---en: 'dog'
● With the help of the trie data structure, write a recursive function that processes text, locating the uniqueness point in each word, and discarding the remainder of each word. How much compression does this give? How readable is the resulting text?
● Obtain some raw text, in the form of a single, long string. Use Python’s
textwrap
module to break it up into multiple lines. Now write code to add extra spaces between words, in order to justify the output. Each line must have the same width, and spaces must be approximately evenly distributed across each line. No line can begin or end with a space.● Develop a simple extractive summarization tool, that prints the sentences of a document which contain the highest total word frequency. Use
FreqDist()
to count word frequencies, and usesum
to sum the frequencies of the words in each sentence. Rank the sentences according to their score. Finally, print the n highest-scoring sentences in document order. Carefully review the design of your program, especially your approach to this double sorting. Make sure the program is written as clearly as possible.● Read the following article on semantic orientation of adjectives. Use the NetworkX package to visualize a network of adjectives with edges to indicate same versus different semantic orientation (see http://www.aclweb.org/anthology/P97-1023).
● Design an algorithm to find the “statistically improbable phrases” of a document collection (see http://www.amazon.com/gp/search-inside/sipshelp.html).
● Write a program to implement a brute-force algorithm for discovering word squares, a kind of n × n: crossword in which the entry in the nth row is the same as the entry in the nth column. For discussion, see http://itre.cis.upenn.edu/~myl/languagelog/archives/002679.html.
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.