You’ve seen some important elements of the Python programming language. Let’s take a few moments to review them systematically.
What is a text? At one level, it is a sequence of symbols on a page such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here’s how we represent text in Python, in this case the opening sentence of Moby Dick:
>>> sent1 = ['Call', 'me', 'Ishmael', '.'] >>>
After the prompt we’ve given a name we made up, sent1
, followed by the equals sign, and then
some quoted words, separated with commas, and surrounded with
brackets. This bracketed material is known as a list in Python: it is how we store a text. We
can inspect it by typing the name .
We can ask for its length . We can even
apply our own lexical_diversity()
function to it .
>>> sent1 ['Call', 'me', 'Ishmael', '.'] >>> len(sent1) 4 >>> lexical_diversity(sent1) 1.0 >>>
Some more lists have been defined for you, one for the opening
sentence of each of our texts, sent2
… sent9
. We inspect two of them here; you can
see the rest for yourself using the Python interpreter (if you get an
error saying that sent2
is not
defined, you need to first type from
nltk.book import *
).
>>> sent2 ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.'] >>> sent3 ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.'] >>>
Note
Your Turn: Make up a few
sentences of your own, by typing a name, equals sign, and a list of
words, like this: ex1 = ['Monty', 'Python',
'and', 'the', 'Holy', 'Grail']
. Repeat some of the other
Python operations we saw earlier in Computing with Language: Texts and Words, e.g.,
sorted(ex1)
, len(set(ex1))
, ex1.count('the')
.
A pleasant surprise is that we can use Python’s addition operator on lists. Adding two lists creates a new list with everything from the first list, followed by everything from the second list:
>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
Note
This special use of the addition operation is called concatenation; it combines the lists together into a single list. We can concatenate sentences to build up a text.
We don’t have to literally type the lists either; we can use short names that refer to pre-defined lists.
>>> sent4 + sent1 ['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.'] >>>
What if we want to add a single item to a list? This is known as
appending. When we append()
to a list, the list itself is updated as a result of
the operation.
>>> sent1.append("Some") >>> sent1 ['Call', 'me', 'Ishmael', '.', 'Some'] >>>
As we have seen, a text in Python is a list of words,
represented using a combination of brackets and quotes. Just as with
an ordinary page of text, we can count up the total number of words in
text1
with len(text1)
, and count the occurrences in a
text of a particular word—say, heaven—using
text1.count('heaven')
.
With some patience, we can pick out the 1st, 173rd, or even
14,278th word in a printed text. Analogously, we can identify the
elements of a Python list by their order of occurrence in the list.
The number that represents this position is the item’s index. We instruct Python to show us the item
that occurs at an index such as 173
in a text by writing the name of the text followed by the index inside
square brackets:
>>> text4[173] 'awaken' >>>
We can do the converse; given a word, find the index of when it first occurs:
>>> text4.index('awaken') 173 >>>
Indexes are a common way to access the words of a text, or, more generally, the elements of any list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing.
>>> text5[16715:16735] ['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it'] >>> text6[1600:1625] ['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week'] >>>
Indexes have some subtleties, and we’ll explore these with the help of an artificial sentence:
>>> sent = ['word1', 'word2', 'word3', 'word4', 'word5', ... 'word6', 'word7', 'word8', 'word9', 'word10'] >>> sent[0] 'word1' >>> sent[9] 'word10' >>>
Notice that our indexes start from zero: sent
element zero, written sent[0]
, is the first word, 'word1'
, whereas sent
element 9 is 'word10'
. The reason is simple: the moment
Python accesses the content of a list from the computer’s memory, it
is already at the first element; we have to tell it how many elements
forward to go. Thus, zero steps forward leaves it at the first
element.
Note
This practice of counting from zero is initially confusing, but typical of modern programming languages. You’ll quickly get the hang of it if you’ve mastered the system of counting centuries where 19XY is a year in the 20th century, or if you live in a country where the floors of a building are numbered from 1, and so walking up n-1 flights of stairs takes you to level n.
Now, if we accidentally use an index that is too large, we get an error:
>>> sent[10] Traceback (most recent call last): File "<stdin>", line 1, in ? IndexError: list index out of range >>>
This time it is not a syntax error, because the program fragment
is syntactically correct. Instead, it is a runtime error, and it produces a Traceback
message that shows the context of
the error, followed by the name of the error, IndexError
, and a brief explanation.
Let’s take a closer look at slicing, using our artificial
sentence again. Here we verify that the slice 5:8
includes sent
elements at indexes 5, 6, and
7:
>>> sent[5:8] ['word6', 'word7', 'word8'] >>> sent[5] 'word6' >>> sent[6] 'word7' >>> sent[7] 'word8' >>>
By convention, m:n
means
elements m…n-1. As the next
example shows, we can omit the first number if the slice begins at the
start of the list , and we can omit the
second number if the slice goes to the end :
>>> sent[:3] ['word1', 'word2', 'word3'] >>> text2[141525:] ['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne', ',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',', 'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of', 'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between', 'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.', 'THE', 'END'] >>>
We can modify an element of a list by assigning to one of its
index values. In the next example, we put sent[0]
on the left of the equals sign . We can also replace an entire slice
with new material . A
consequence of this last change is that the list only has four
elements, and accessing a later value generates an error .
>>> sent[0] = 'First' >>> sent[9] = 'Last' >>> len(sent) 10 >>> sent[1:9] = ['Second', 'Third'] >>> sent ['First', 'Second', 'Third', 'Last'] >>> sent[9] Traceback (most recent call last): File "<stdin>", line 1, in ? IndexError: list index out of range >>>
Note
Your Turn: Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier. Check your understanding by trying the exercises on lists at the end of this chapter.
From the start of Computing with Language: Texts and Words, you have had
access to texts called text1
,
text2
, and so on. It saved a lot of
typing to be able to refer to a 250,000-word book with a short name
like this! In general, we can make up names for anything we care to
calculate. We did this ourselves in the previous sections, e.g.,
defining a variable sent1
, as follows:
>>> sent1 = ['Call', 'me', 'Ishmael', '.'] >>>
Such lines have the form: variable =
expression. Python will evaluate the expression, and save
its result to the variable. This process is called assignment. It does not generate any output;
you have to type the variable on a line of its own to inspect its
contents. The equals sign is slightly misleading, since information is
moving from the right side to the left. It might help to think of it
as a left-arrow. The name of the variable can be anything you like,
e.g., my_sent
, sentence
, xyzzy
. It must start with a letter, and can
include numbers and underscores. Here are some examples of variables
and assignments:
>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', ... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold'] >>>
Remember that capitalized words appear before lowercase words in sorted lists.
Note
Notice in the previous example that we split the definition of
my_sent
over two lines. Python
expressions can be split across multiple lines, so long as this
happens within any kind of brackets. Python uses the ...
prompt to indicate that more input is
expected. It doesn’t matter how much indentation is used in these
continuation lines, but some indentation usually makes them easier
to read.
It is good to choose meaningful variable names to remind you—and
to help anyone else who reads your Python code—what your code is meant
to do. Python does not try to make sense of the names; it blindly
follows your instructions, and does not object if you do something
confusing, such as one = 'two'
or
two = 3
. The only restriction is
that a variable name cannot be any of Python’s reserved words, such as
def
, if
, not
,
and import
. If you use a reserved
word, Python will produce a syntax error:
>>> not = 'Camelot' File "<stdin>", line 1 not = 'Camelot' ^ SyntaxError: invalid syntax >>>
We will often use variables to hold intermediate steps of a
computation, especially when this makes the code easier to follow.
Thus len(set(text1))
could also be
written:
>>> vocab = set(text1) >>> vocab_size = len(vocab) >>> vocab_size 19317 >>>
Caution!
Take care with your choice of names (or identifiers) for Python variables. First,
you should start the name with a letter, optionally followed by
digits (0
to 9
) or letters. Thus, abc23
is fine, but 23abc
will cause a syntax error. Names are
case-sensitive, which means that myVar
and myvar
are distinct variables. Variable
names cannot contain whitespace, but you can separate words using an
underscore, e.g., my_var
. Be
careful not to insert a hyphen instead of an underscore: my-var
is wrong, since Python interprets
the -
as a minus sign.
Some of the methods we used to access the elements of a list also work with individual words, or strings. For example, we can assign a string to a variable , index a string , and slice a string .
>>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>>
We can also perform multiplication and addition with strings:
>>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>>
We can join the words of a list to make a single string, or split a string into a list, as follows:
>>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python'] >>>
We will come back to the topic of strings in Chapter 3. For the time being, we have two important building blocks—lists and strings—and are ready to get back to some language analysis.
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.