If you’re reading this, I’d love to hear any feedback you have. Please post it to firstname.lastname@example.org. Thanks a lot.
You’ll also find a working copy of the Nobel visualization the book literally and figuratively builds toward at http://kyrandale.com/static/pyjsdataviz/index.html.
A primary motivation for writing the book is the belief that, whatever data you have and whatever story you want to tell with it, the natural home for the visualizations you transform it into is the Web. As a delivery platform, it is orders of magnitude more powerful than what came before, and this book aims to smooth the passage from desktop- or server-based data analysis and processing to getting the fruits of that labor out on the Web.
But the most ambitious aim of this book is to persuade you that working with these two powerful languages toward the goal of delivering powerful web visualizations is actually fun and engaging.
Many years ago, as an academic researcher, I came across Python and fell in love. I had been writing some fairly complex simulations in C++, and Python’s simplicity and power was a breath of fresh air from all the boilerplate Makefiles, declarations, definitions, and the like. Programming became fun. Python was the perfect glue, playing nicely with my C++ libraries (Python wasn’t then and still isn’t a speed demon) and doing, with consummate ease, all the stuff that is such a pain in low-level languages (e.g., file I/O, database access, and serialization). I started to write all my graphical user interfaces (GUIs) and visualizations in Python, using wxPython, PyQt, and a whole load of other refreshingly easy toolsets. Unfortunately, although I think some of these tools are pretty cool and would love to share them with the world, the effort required to package them, distribute them, and make sure they still work with modern libraries represents a hurdle I’m unlikely to ever overcome.
Toolkits like Tableau, although very impressive, are often, in my experience, ultimately frustrating for programmers. There’s no way to replicate in a GUI the expressive power of a good, general-purpose programming language. Plus, what if you want to create a little web server to deliver your processed data? That means learning at least one new web-development-capable language.
Automated code conversion may well do the job, but the code produced is usually pretty impenetrable for a human being.
You are limited to the subset of plot types currently available in the libraries.
Why you should choose Python for your data-processing needs is a little more involved. For a start, there are good alternatives as far as data processing is concerned. Let’s deal with a few candidates for the job, starting with the enterprise behemoth Java.
Among the other main, general-purpose programming languages, only Java offers anything like the rich ecosystem of libraries that Python does, with considerably more native speed too. But while Java is a lot easier to program in than languages like C++, it isn’t, in my opinion, a particularly nice language to program in, having rather too much in the way of tedious boilerplate code and excessive verbiage. This sort of thing starts to weigh heavily after a while and makes for a hard slog at the code face. As for speed, Python’s default interpreter is slow, but Python is a great glue language that plays nicely with other languages. This ability is demonstrated by the big Python data-processing libraries like NumPy (and its dependent, Pandas), Scipy, and the like, which use C++ and Fortran libraries to do the heavy lifting while providing the ease of use of a simple, scripting language.
The venerable R has, until recently, been the tool of choice for many data scientists and is probably Python’s main competitor in the space. Like Python, R benefits from a very active community, some great tools like the plotting library ggplot, and a syntax specially crafted for data science and statistics. But this specialism is a double-edged sword. Because R was developed for a specific purpose, it means that if, for example, you wish to write a web server to serve your R-processed data, you have to skip out to another language with all the attendant learning overheads, or try to hack something together in a round-hole/square-peg sort of way. Python’s general-purpose nature and its rich ecosystem mean one can do pretty much everything required of a data-processing pipeline (JS visuals aside) without having to leave its comfort zone. Personally, it is a small sacrifice to pay for a little syntactic clunkiness.
There are other alternatives to doing your data processing with Python, but none of them come close to the flexibility and power afforded by a general-purpose, easy-to-use programming language with a rich ecosystem of libraries. While, for example, mathematical programming environments such as Matlab and Mathematica have active communities and a plethora of great libraries, they hardly count as general purpose, because they are designed to be used within a closed garden. They are also proprietary, which means a significant initial investment and a different vibe to Python’s resoundingly open source environment.
GUI-driven dataviz tools like Tableau are great creations but will quickly frustrate someone used to the freedom to programming. They tend to work great as long as you are singing from their songsheet, as it were. Deviations from the designated path get painful very quickly.
As things stand, I think a very good case can be made for Python being the budding data scientist’s language of choice. But things are not standing still; in fact, Python’s capabilities in this area are growing at an astonishing rate. To put it in perspective, I have been programming in Python for over 15 years and have grown used to being surprised if I can’t find a Python module to help solve a problem at hand, but I find myself surprised at the growth of Python’s data-processing abilities, with a new, powerful library appearing weekly. To give an example, Python has traditionally been weak on statistical analysis libraries, with R being far ahead. Recently a number of powerful modules, such as StatsModel, have started to close this gap fast.
So Python is a thriving data-processing ecosystem with pretty much unmatched general purpose, and it’s getting better week by week. It’s understandable why so many in the community are in a state of such excitement—it’s pretty exhilarating.
In that sense, this book aims to give you a solid backbone of practical knowledge, strong enough to take the weight of future development. I aim to make the learning curve as shallow as possible and get you over the initial climb with the practical skills needed to start refining your art.
This book emphasizes pragmatism and best practice. It’s going to cover a fair amount of ground, and there isn’t enough space for too many theoretical diversions. I will aim to cover those aspects of the libraries in the toolchain that are most commonly used, and point you to resources for the other stuff. Most libraries have a hard core of functions, methods, classes, and the like that are the chief, functional subset. With these at your disposal, you can actually do stuff. Eventually, you’ll find an itch you can’t scratch with those, at which time good books, documentation, and online forums will be your friend.
Open source and free as in beer—you shouldn’t have to invest any extra money to learn with this book.
Longevity—generally well-established, community-driven, and popular.
Best of breed (assuming good support and an active community), at the sweet spot between popularity and utility.
The skills you learn here should be relevant for a long time. Generally, the obvious candidates have been chosen—libraries that write their own ticket, as it were. Where appropriate, I will highlight the alternative choices and give a rationale for my selection.
A few preliminary chapters are needed before beginning the transformative journey of our Nobel dataset through the toolchain. These cover the basic skills necessary to make the rest of the toolchain chapters run more fluidly. The first few chapters cover the following:
How to pass around data with Python, through various file formats and databases
Covering the basic web development needed by the book
These chapters are part tutorial, part reference, and it’s fine to skip straight to the beginning of the toolchain, dipping back where needed.
The first challenge for any data visualizer is getting hold of the data they need, whether by request or to scratch a personal itch. If you’re very lucky, this will be delivered to you in pristine form, but more often than not you have to go find it. I’ll cover the various ways you can use Python to get data off the Web (e.g., web APIs or Google spreadsheets). The Nobel Prize dataset for the toolchain demonstration is scraped from its Wikipedia pages using Scrapy.2
Python’s Scrapy is an industrial-strength scraper that does all the data throttling and media pipelining, which are indispensable if you plan on scraping significant amounts of data. Scraping is often the only way to get the data you are interested in, and once you’ve mastered Scrapy’s workflow, all those previously off-limits datasets are only a spider away.3
The dirty secret of dataviz is that pretty much all data is dirty, and turning it into something you can use may well occupy a lot more time than anticipated. This is an unglamorous process that can easily steal over half your time, which is all the more reason to get good at it and use the right tools.
Pandas is a huge player in the Python data-processing ecosystem. It’s a Python data-analysis library whose chief component is the
DataFrame, essentially a programmatic spreadsheet. Pandas extends NumPy, Python’s powerful numeric library, into the realm of heterogeneous datasets, the kind of categorical, temporal, and ordinal information that data visualizers have to deal with. As well as being great for interactively exploring your data (using its built-in Matplotlib plots), Pandas is well suited to the drudge-work of cleaning data, making it easy to locate duplicate records, fix dodgy date-strings, find missing fields, and so on.
Before beginning the transformation of your data into a visualization, you need to understand it. The patterns, trends, and anomalies hidden in the data will inform the stories you are trying to tell with it, whether that’s explaining a recent rise in year-by-year widget sales or demonstrating global climate change.
Once the data is cleaned and refined, we have the visualization phase, where selected reflections of the dataset are presented, ideally allowing the user to explore them interactively. Depending on the data, this might involve bar charts, maps, or novel visualizations.
In addition to the big libraries covered, there is a large supporting cast of smaller libraries. These are the indispensable smaller tools, the hammers and spanners of the toolchain. Python in particular has an incredibly rich ecosystem, with small, specialized libraries for almost every conceivable job. Among the strong supporting cast, some particularly deserving of mention are:
A great addition to Python’s plotting powerhouse Matplotlib, adding some very useful plot types including some statistical ones of particular use to data visualizers. It also adds arguably superior aesthetics, overriding the Matplotlib defaults.
crossfilter being a stand-out. It enables very fast filtering of row-columnar datasets and is ideally suited to dataviz work, which is unsurprising because one of its creators is Mike Bostock, the father of D3.
The remaining parts of the book, following our toolchain as it transforms a fairly uninspiring web list into a fully fledged, interactive D3 visualization, are essentially self-contained. If you want to dive immediately into Part III and some data cleaning and exploration with Pandas, go right ahead, but be aware that it assumes the existence of a dirty Nobel Prize dataset. You can see how that was produced by Scrapy later if that fits your schedule. Equally, if you want to dive straight into creating the Nobel-viz app in parts Part IV and Part V, be aware that they assume a clean Nobel Prize dataset.
Whatever route you take, I suggest eventually aiming to acquire all the basic skills covered in the book if you intend to make dataviz your profession.
This is a practical book and assumes that the reader has a pretty good idea of what he or she wants to visualize and how that visualization should look and feel, as well as a desire to get cracking on it, unencumbered by too much theory. Nevertheless, drawing on the history of data visualization can both clarify the central themes of the book and add valuable context. It can also help explain why now is such an exciting time to be entering the field, as technological innovation is driving novel dataviz forms, and people are grappling with the problem of presenting the increasing amount of multidimensional data generated by the Internet.
Data visualization has an impressive body of theory behind it and there are some great books out there that I recommend you read (see “Recommended Books” for a little selection). The practical benefit of understanding the way humans visually harvest information cannot be overstated. It can be easily demonstrated, for example, that a pie chart is almost always a bad way of presenting comparative data and a simple bar chart is far preferable. By conducting psychometric experiments, we now have a pretty good idea of how to trick the human visual system and make relationships in the data harder to grasp. Conversely, we can show that some visual forms are close to optimal for amplifying contrast. The literature, at its very least, provides some useful rules of thumb that suggest good candidates for any particular data narrative.
In essence, good dataviz tries to present data, collected from measurements in the world (empirical) or as the product of abstract mathematical explorations (e.g., the beautiful fractal patterns of the Mandlebrot set), in such a way as to draw out or emphasize any patterns or trends that might exist. These patterns can be simple (e.g., average weight by country), or the product of sophisticated statistical analysis (e.g., data clustering in a higher dimensional space).
In its untransformed state, we can imagine this data floating as a nebulous cloud of numbers or categories. Any patterns or correlations are entirely obscure. It’s easy to forget but the humble spreadsheet (Figure P-3 a) is a data visualization—the ordering of data into row-columnar form an attempt to tame it, make its manipulation easier, and highlight discrepancies (e.g., actuarial bookkeeping). Of course, most people are not adept at spotting patterns in rows of numbers so more accessible, visual forms were developed to engage with our visual cortex, the prime human conduit for information about the world. Enter the bar chart, pie chart,6 and line chart. More imaginative ways were employed to distill statistical data in a more accessible form, one of the most famous being Charles Joseph Minard’s visualization of Napoleon’s disastrous Russian campaign of 1812 (Figure P-3 b).
The tan-colored stream in Figure P-3 b shows the advance of Napoleon’s army on Moscow; the black line shows the retreat. The thickness of the stream represents the size of Napoleon’s army, thinning as casualties mounted. A temperature chart below is used to indicate the temperature at locations along the way. Note the elegant way in which Minard has combined multidimensional data (casualty statistics, geographical location, and temperature) to give an impression of the carnage, which would be hard to grasp in any other way (imagine trying to jump from a chart of casualties to a list of locations and make the necessary connections). I would argue that the chief problem of modern interactive dataviz is exactly the same as that faced by Minard: how to move beyond conventional one-dimensional bar charts (perfectly good for many things) and develop new ways to communicate cross-dimensional patterns effectively.
Until quite recently, most of our experience of charts was not much different from those of Charles Minard’s audience. They were pre-rendered and inert, and showed one reflection of the data, hopefully an important and insightful one but nevertheless under total control of the author. In this sense, the replacement of real ink points with computer screen pixels was only a change in the scale of distribution.
Up next is the first part of the book, covering the preliminary skills needed for the toolchain. You can work through it now or skip ahead to Part II and the start of the toolchain, referring back when needed.
Tufte, Edward. The Visual Display of Quantitative Information. Graphics Press, 1983.
Ware, Colin. Information Visualization: Perception for Design. Morgan Kaufmann, 2004.
Rosenberg, Daniel. Cartographies of Time: A History of the Timeline. Princeton Architectural Press, 2012.
Few, Stephen. Information Dashboard Design: Displaying Data for at-a-glance Monitoring. Analytics Press, 2013.
Cairo, Alberto. The Functional Art. New Riders, 2012.
Bertin, Jacques. Semiology of Graphics: Diagrams, Networks, Maps. Esri Press, 2010.
3 Scrapy’s controllers are called spiders.
4 The scientific Python library, part of the NumPy ecosystem.
5 REST is short for Representational State Transfer, the dominant style for HTTP-based web APIs and much recommended.
6 William Playfair’s Statistical Breviary of 1801 having the dubious distinction of originating the pie chart.