Chapter 9. Getting Data
To write it, it took three months; to conceive it, three minutes; to collect the data in it, all my life.
F. Scott Fitzgerald
In order to be a data scientist you need data. In fact, as a data scientist you will spend an embarrassingly large fraction of your time acquiring, cleaning, and transforming data. In a pinch, you can always type the data in yourself (or if you have minions, make them do it), but usually this is not a good use of your time. In this chapter, we’ll look at different ways of getting data into Python and into the right formats.
stdin and stdout
If you run your Python scripts at the command line, you can pipe data through them using sys.stdin and sys.stdout. For example, here is a script that reads in lines of text and spits back out the ones that match a regular expression:
# egrep.pyimportsys,re# sys.argv is the list of command-line arguments# sys.argv[0] is the name of the program itself# sys.argv[1] will be the regex specified at the command lineregex=sys.argv[1]# for every line passed into the scriptforlineinsys.stdin:# if it matches the regex, write it to stdoutifre.search(regex,line):sys.stdout.write(line)
And here’s one that counts the lines it receives and then writes out the count:
# line_count.pyimportsyscount=0forlineinsys.stdin:count+=1# print goes to sys.stdoutcount
You could then use these to count how many lines of a file contain numbers. In Windows, you’d use:
type SomeFile.txt ...