Exploring the Data

There are many great tools for data analysis. Some of the most commonly used are compared in Table 17-2.

Table 17-2. Comparison of data analysis packages

Name	Advantages	Disadvantages	Open source?	Typical users
R	Library support; visualization	Steep learning curve	Yes	Statistics
Matlab	Elegant matrix support; visualization	Expensive; incomplete statistics support	No	Engineering
SciPy/NumPy/Matplotlib	Python: flexible and general-purpose programming language	Components poorly integrated	Yes	Engineering
Excel	Easy; visual; flexible	Large data sets; weak numeric and programming support	No	Business
SAS	Very large data sets	Very baroque; hardest to learn	No	Business
SPSS, Stata	Easy statistical analysis	Inflexible	No	Science (bio and social)

We like to use R, which is an open source statistical and visualization programming environment with a vibrant and growing development community. It's emerged as a de facto standard among statisticians. For exploratory data analysis, we prefer it to the other options because of its graphing libraries, convenient indexing notation, and an amazing array of statistically sophisticated, community-maintained packages. You can read about it and download it at http://www.r-project.org; also look at the references at the end of this chapter.

R provides many excellent tools for looking at what's in the data. >From its interactive interpreter:

Load the data  > data = read.delim("http://data.doloreslabs.com/face_scores.tsv", sep="\t") 
and plot.      > plot(data)

Given a basic table of ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Beautiful Data by Toby Segaran, Jeff Hammerbacher

Exploring the Data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly