Chapter 4. Data in Files and Arrays: Sort it out
As your programs develop, so do your data handling needs.
And when you have lots of data to work with, using an individual variable for each piece of data gets really old, really quickly. So programmers employ some rather awesome containers (known as data structures) to help them work with lots of data. More times than not, all that data comes from a file stored on a hard disk. So, how can you work with data in your files? Turns out it’s a breeze. Flip the page and let’s learn how!
Surf’s up in Codeville
The annual Codeville Surf-A-Thon is more popular than ever this year.
Because there are so many contestants, the organizers asked you to write a Python program to process the scores. Eager to please, you agreed.
The trouble is, even though the contest is over and the beach is now clear, you can’t hit the waves until the program is written. Your program has to work out the highest surfing scores. Despite your urge to surf, a promise is a promise, so writing the program has to come first.
Find the highest score in the results file
After the judges rate the competitors, the scores are stored in a file called results.txt
. There is one line in the file for each competitor’s score. You need to write a program that reads through each of these lines, picks out the score, and then works out the highest score in the Surf-A-Thon.
It sounds simple enough, except for one small detail. You’ve written programs to read data from the Web, and read data that’s been typed in at the keyboard, but you haven’t yet written any code that reads data stored in a file.
Iterate through the file with the open, for, close pattern
If you need to read from a file using Python, one way is to use the built-in open()
command. Open a file called results.txt
like this:
The call to open()
creates a file handle, which is a shorthand that you’ll use to refer to the file you are working with within your code.
Because you’ll need to read the file one line at a time, Python gives you the for
loop for just this purpose. Like while
loops, the for
loop runs repeatedly, running the loop code once for each of the items in something. Think of a for
loop as your very own custom-made data shredder:
Each time the body of the for
loop runs, a variable is set to a string containing the current line of text in the file. This is referred to as iterating through the data in the file:
The file contains more than numbers...
To see what happened, let’s take another look at the judge’s score sheet to see if you missed anything:
The judges also recorded the name of each surf contestant next to his or her score. This is a problem for the program only if the name was added to the results.txt
file. Let’s take a look:
Sure enough, the results.txt
file also contains the contestant names. And that’s a problem for our code because, as it iterates through the file, the string you read is no longer just a number.
Split each line as you read it
Each line in the for
loop represents a single string containing two pieces of information:
You need to somehow extract the score from the string. In each line, there is a name, followed by a space, followed by the score. You already know how to extract one string from another; you did it for Starbuzz back in Chapter 2. And you could do something similar here using the find()
method and index manipulation, searching for the position of a space (‘ ’) character in each line and then extracting the substring that follows it.
Programmers often have to deal with data in strings that contain several pieces of data separated by spaces. It’s so common, in fact, that Python provides a special string method to perform the cutting you need: split()
.
Python strings have a built-in split()
method.
The split() method cuts the string
Imagine you have a string containing several words assigned to a variable. Think of a variable as if it’s a labeled jar:
The rock_band
string, like all Python strings, has a split()
method that returns a collection of substrings: one for each word in the original string.
Using a programming feature called multiple assignment, you can take the result from the cut performed by split()
and assign it to a collection of variables:
Each of the return values from the split()
on rock_band
is assigned to its own separately named variable, which allows you then to work with each word in whatever way you want. Note that the rock_band
variable still exists and that it still contains the original string of four names.
Looks like you can use multiple assignment and split()
to extract the scores from the results.txt
file.
But you need more than one top score
As soon as the top score appears, people start to wonder what the second and third highest scores are:
It seems that the organizers didn’t tell you everything you needed to know. The contest doesn’t just award a prize for the winner, but also honors those surfers in second and third place.
Our program currently iterates through each of the lines in the results.txt
file and works out the highest score. But what it actually needs to do is keep track of the top three scores, perhaps in three separate variables:
Keeping track of 3 scores makes the code more complex
So how will you keep track of the extra scores? You could do something like this:
You can see that there’s a lot more logic here, because the program needs to “think” a bit more. Unfortunately, turning this logic into code will make the program longer and harder to change in the future. And, let’s be honest, it’s somewhat more difficult to understand what’s actually going on with the logic as shown here.
How could you make this simpler?
An ordered list makes code much simpler
If you had some way of reading the data from the file and then producing an ordered copy of the data, the program would be a lot simpler to write. Ordering data within a program is known as “sorting:”
But how do you order, or sort, your data? What happens to the original data in the file? Does it remain unsorted or is it sorted, too? Can the data even be sorted on disk and, if so, does this make things easier, faster, or slower?
Sorting sounds tricky... is there a “best” way?
Sorting is easier in memory
If you are writing a program that is going to deal with a lot of data, you need to decide where you need to keep that data while the program works with it. Most of the time, you will have two choices:
Keep the data in files on the disk.
If you have a very large amount of data, the obvious place to put it is on disk. Computers can store far more data on disk than they can in memory. Disk storage is persistent: if you yank the power cord, the computer doesn’t forget the information written on the disk. But there is one real problem with manipulating data on disk: it can be very slow.
Keep the data in memory.
Data is much quicker to access and change if it’s stored in the computer’s memory. But, it’s not persistent: data in memory disappears when your program exits, or when the computer is switched off (unless you remember to save it to a file, in which case it becomes persistent).
Keep the data in memory
If you want to sort a lot of data, you will need to shuffle data around quite a lot. This is much faster in memory than on disk.
Of course, before you sort the data, you need to read it into memory, perhaps into a large number of individual variables:
You can’t use a separate variable for each line of data
Programming languages use variables to give you access to data in memory. So if you are going to store the data from the results.txt
file in memory, it makes sense that you’ll need to use lots of variables to access all the data, right?
But how many variables do you need?
Imagine the file just had three scores in it. You could write a program that read each of the lines from the file and stored them in variables called first_score
, second_score
, and third_score
:
But what if there were four scores in the file? Or five? Even worse, what if there were 10,000 scores? You’d soon run out of variable names and (possibly) memory in your computer, not to mention the wear and tear on your fingers.
Sometimes, you need to deal with a whole bundle of data, all at once. To do that, most languages give you the array.
An array lets you manage a whole train of data
So far, you’ve used variables to store only a single piece of data. But sometimes, you want to refer to a whole bunch of data all at once. For that, you need a new type of variable: the array.
An array is a “collection variable” or data structure. It’s designed to group a whole bunch of data items together in one place and give them a name.
Think of an array as a data train. Each car in the train is called an array element and can store a single piece of data. If you want to store a number in one element and a string in another, you can.
You might think that as you are storing all of that data in an array, you still might need variables for each of the items it stores. But this is not the case. An array is itself just another variable, and you can give it its own variable name:
Even though an array contains a whole bunch of data items, the array itself is a single variable, which just so happens to contain a collection of data. Once your data is in an array, you can treat the array just like any other variable.
So how do you use arrays?
Python gives you arrays with lists
Sometimes, different programming languages have different names for roughly the same thing. For example, in Python most programmers think array when they are actually using a Python list. For our purposes, think of Python lists and arrays as the essentially same thing.
Note
Python coders typically use the word “array” to more correctly refer to a list that contains only data of one type, like a bunch of strings or a bunch of numbers. And Python comes with a built-in technology called “array” for just that purpose. However, as lists are very similar and much more flexible, we prefer to use them, so you don’t need to worry about this distinction for now.
You create an array in Python like this:
You can read individual pieces of data from inside the array using an index, just like you read individual characters from inside a string.
As with strings, the index for the first piece of data is 0. The second piece has index 1, and so on.
Sort the array before displaying the results
The array is storing the scores in the order they were read from the file. However, you still need to sort them so that the highest scores appear first.
You could sort the array by comparing each of the elements with each of the other elements, and then swap any that are in the wrong order.
Arrays in Python have a whole host of methods that make many tasks easier.
Let’s see which ones might help.
Brain Barbell
Can you work out which two methods you need to employ to allow you to sort the data in the order that you need?
Brain Barbell Solution
You were to work out which two methods you needed to employ to allow you to sort the data in the order that you needed.
The sort()
and reverse()
methods look the most useful. You need to use reverse()
after you sort()
the data, because the default ordering used by sort()
is lowest-to-highest, the opposite of what you need.
Sort the scores from highest to lowest
You now need to add the two method calls into your code that will sort the array. The lines need to go between the code that reads the data into the list and before the code that displays the first three elements:
Geek Bits
It was very simple to sort an array of data using just two lines of code. But it turns out you can do even better than that if you use an option with the sort()
method. Instead of using these two lines:
scores.sort() scores.reverse()
you could have used just one, which gives the same result: scores.sort(reverse = True)
And the winner is...?
It’s time for the award ceremony.
The prizes are lined up and the scores are on the scoreboard. There’s just one problem.
Nobody knows which surfer got which score.
You somehow forgot the surfer names
With your rush to catch some waves before the light is gone, you forgot about the other piece of data stored in the results.txt
file: the name of each surfer.
Without the names, you can’t possibly know which score goes with which name, so the scoreboard is only half-complete.
The trouble is, your array stores one data item in each element, not two. Looks like you still have your work cut out for you. There’ll be no catching waves until this issue is resolved.
How do you think you can remember the names and the scores for each surfer in the contest?
Once you’ve thought about this problem, turn over to Chapter 5 and see if you can resolve this issue.
Your Programming Toolbox
You’ve got Chapter 4 under your belt. Let’s look back at what you’ve learned in this chapter:
Programming Tools
* files - reading data stored on disk
* arrays - a collection variable that holds multiple data items that can be accessed by index
* sorting - arranging a collection in a specific order
Python Tools
* open() - open a file for processing
* close() - close a file
* for - iterate over something
* string.split() - cut a string into multiple parts
* [] - the array index operator
* array.append() - add an item to the end of an array
* array.sort() - sort an array, lowest-to-highest
* array.reverse() - change the order of an array by reversing it
Get Head First Programming now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.