Exploring the Data
There are many great tools for data analysis. Some of the most commonly used are compared in Table 17-2.
Table 17-2. Comparison of data analysis packages
Name | Advantages | Disadvantages | Open source? | Typical users |
---|---|---|---|---|
R | Library support; visualization | Steep learning curve | Yes | Statistics |
Matlab | Elegant matrix support; visualization | Expensive; incomplete statistics support | No | Engineering |
SciPy/NumPy/Matplotlib | Python: flexible and general-purpose programming language | Components poorly integrated | Yes | Engineering |
Excel | Easy; visual; flexible | Large data sets; weak numeric and programming support | No | Business |
SAS | Very large data sets | Very baroque; hardest to learn | No | Business |
SPSS, Stata | Easy statistical analysis | Inflexible | No | Science (bio and social) |
We like to use R, which is an open source statistical and visualization programming environment with a vibrant and growing development community. It's emerged as a de facto standard among statisticians. For exploratory data analysis, we prefer it to the other options because of its graphing libraries, convenient indexing notation, and an amazing array of statistically sophisticated, community-maintained packages. You can read about it and download it at http://www.r-project.org; also look at the references at the end of this chapter.
R provides many excellent tools for looking at what's in the data. >From its interactive interpreter:
Load the data > data = read.delim("http://data.doloreslabs.com/face_scores.tsv", sep="\t") and plot. > plot(data)
Given a basic table of ...
Get Beautiful Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.