Chapter 1. Visualize Data Analytics

Gideon Goldin

Introduction

Let’s begin by imagining that you are an auto manufacturer, and you want to be sure you are getting a good deal when it comes to buying the parts you need to build your cars. Doing this means you need to run some analyses over the data you have about spend with your suppliers; this data includes invoices, receipts, contracts, individual transactions, industry reports, etc. You may learn, for example, that you are purchasing the same steel from multiple suppliers, one of which happens to be both the least expensive and the most reliable. With this newfound knowledge, you engage in some negotiations around your supply chain, saving a substantial amount of money.

As appealing as this vignette might sound in theory, practitioners may be skeptical. How do you discover and explore, let alone unify, an array of heterogeneous datasets? How do you solicit dozens or hundreds of experts’ opinions to clean your data and inform your algorithms? How do you visualize patterns that may change quarter-to-quarter, or even second-to-second? How do you foster communication and transparency around siloed research initiatives? Traditional data management systems, social processes, and the user interfaces that abstract them become less useful as you collect more and more data [21], while latent opportunity may grow exponentially. Organizations need better ways to reason about such data.

Many of these problems have motivated the field of Visual Analytics (VA)—the science of analytical reasoning facilitated by interactive visual interfaces [1]. The objective of this chapter is to provide a brief review of VA’s underpinnings, including data management & analysis, visualization, and interaction, before highlighting the ways in which a data-centric organization might approach visual analytics—holistically and collaboratively.

Defining Visual Analytics

Where humans reason slowly and effortfully, computers are quick; where computers lack intuition and creativity, humans are productive. Though this dichotomy is oversimplified, the details therein inspire the core of VA. Visual analytics employs a combination of technologies, some human, some human-made, to enable more powerful computation. As Keim et al. explain in Mastering the information age-solving problems with visual analytics, VA integrates “the best of both sides.” Visual analytics integrates scientific disciplines to optimize the division of cognitive labor between human and machine [7].

The need for visual analytics is not entirely new; a decade has now passed since the U.S. solicited leaders from academia, industry, and government to set an initial agenda for the field. This effort, sponsored by the Department of Homeland Security and led by the newly chartered National Visualization and Analytics Center, was motivated in part by a growing need to better utilize the enormous and enormously disparate stores of data that governments had been amassing for so long [1]. While the focus of this agenda was post-9/11 security,¹ similar organizations (like the European VisMaster CA) share many of its goals [3]. Today, applications for VA abound, spanning beyond national security to quantified self [5], digital art [2], and of course, business intelligence.

Keim et al. go on to expand on Thomas and Cook’s definition from Illuminating the path: The research and development agenda for visual analytics [1]—citing several goals in the process:

Synthesize information and derive insight from massive, dynamic, ambiguous, and often conflicting data
Detect the expected and discover the unexpected
Provide timely, defensible, and understandable assessments
Communicate assessment effectively for action

These are broad goals that eventuate a particularly multidisciplinary approach; the following are just some of the fields involved in the scope of visual analytics [11]:

Information analytics
Geospatial analytics
Scientific & statistical analytics
Knowledge discovery
Data management & knowledge representation
Presentation, production & dissemination
Cognitive & perceptual science
Interaction

Role of Data Management and Analysis

While traditional database research has focused on homogeneous, structured data, today’s research looks to solve problems like unification across disparate, heterogeneous sources (e.g., streaming sensors, HTML, log files, relational databases, etc.) [7].

Returning to our auto manufacturing example, this means our analyses need to integrate across a diverse set of sources—an effort that, as Michael Stonebraker [38] notes in Getting Data Right, is necessarily involved—requiring that we ingest the data, clean errors, transform attributes, match schemas, and remove duplicates.

Even with a small number of sources, doing this manually is slow, expensive, and prone to error. To scale, one must make use of statistics and machine learning to do as much of the work as possible, while keeping humans in the loop only for guidance (e.g., helping to align cryptic coding schemas). Managing and analyzing these kinds of data cannot be done in isolation; the task is multifaceted and often requires collaboration and visualization; meanwhile, visualization requires curated or prepared data. Ultimately, we need interactive systems with interfaces that support seamless data integration, enrichment, and cleaning [22].

Role of Data Visualization

Before elucidating the visual component of VA, it is helpful to define visualization. In information technology, visualization usually refers to something like that defined by Card et al. in Readings in information visualization: “the use of computer-supported, interactive visual representations of data to amplify cognition” [24].

Visualization is powerful because it fuels the human sense with the highest bandwidth: vision (300 Mb/s [28]). Roughly 20 billion of our brain’s neurons are devoted to visual analysis, more than any other sense [28], and cognitive science commonly refers to vision as a foundational representation in the human mind. Because of this, visualization is bound to play a critical role in any data-heavy context—in fact, the proliferation of data is what helped to popularize visualization.²

Today, data visualization (DataVis) serves two major categories of data: scientific measurements and abstract information.

Scientific Visualization: Scientific Visualization (SciVis) is typically concerned with the representation of physical phenomena, often 3D geometries or fields that span space and time [7]. The purpose of these visualizations is often exploratory in nature, ranging across a wide variety of topics—whether investigating the complex relationships in a rat brain or a supernova [27].
Information Visualization: Information Visualization (InfoVis), on the other hand, is useful when no explicit spatial references are provided [28]. These are often the bar graphs and scatter plots on the screens of visual analysts in finance, healthcare, media, etc. These diagrams offer numerous benefits, one of which is taking advantage of visual pattern recognition to aid in model finding during exploratory data analysis.

Many of the most successful corporations have been quick to adopt database technologies. As datasets grow larger faster, the corporations that have augmented their database management systems with information visualization have been better-enabled to utilize their increasingly valuable assets.³ It can be said that VA does for data analysis what InfoVis did for databases [7].

While InfoVis may lay the foundation for VA, its scope falls far outside this book. Worth noting, however, is the challenge of visualizing “big data.” Many of today’s visualizations are born of multidimensional datasets (with hundreds or thousands of variables with different scales of measurement), where traditional or static, out-of-the-box diagrams do not suffice [7]. Research here constitutes a relatively new field that is constantly extending existing visualizations (e.g., parallel coordinates [30], treemaps [29], etc.), inventing new ones, and devising methods for interactive querying over improved visual summaries [19]. The bigger the data, the greater the need for DataVis; the tougher the analytics, the greater the need for VA.

Role of Interaction

Visual analytics is informed by technical achievements not just in data management, analysis, and visualization, but also in interface design. If VA is to unlock the opportunity behind information overload, then thoughtful interaction is key.

In addition to the DataVis vs. SciVis distinction, there is sometimes a line drawn between exploratory and explanatory (or expository) visualization, though it grows more blurred with time. Traditionally, exploratory DataVis is done by people that rely on vision to perform hypothesis generation and confirmation, while explanatory DataVis comprises summaries over such analyses. Though both exercises are conducted by individuals, only the latter has a fundamentally social component—it generates an artifact to be shared.

VA is intimately tied with exploratory visualization, as it must facilitate reasoning (which is greatly enhanced by interaction). Causal reasoning, for example, describes how we predict effects from causes (e.g., forecasting a burst from a financial bubble) or how we infer causes from effects (e.g., diagnosing an epidemic from shared symptomologies). By interacting, or intervening, we are able to observe not just the passive world, but also the consequences of our actions. If I observe the grass to be wet, I may raise my subjective probability that it has rained. As Pearl [33] notes, though, observing that the grass is wet after I turn on the sprinklers would not allow me to draw the same inference.

The same is true in software; instead of manipulating the world, we manipulate a simulation before changing data, models, views, or our minds. In the visual analytics process, data from heterogeneous and disparate sources must somehow be integrated before we can begin visual and automated analysis methods [3].

The same big data challenges of InfoVis apply to interaction. The volume of modern data tends to actually discourage interaction, because users are not likely to wait more than a few seconds for a filter query to extract relevant evidence (and such delays can change usage even if users are unaware [23]). As Nielson [34] noted in 1993, major guidelines regarding response times have not changed for thirty years—one such guideline is the notion that “0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.” After this, the user will exchange the feeling of directly manipulating [35] the data for one of delegating jobs to the system. As these are psychological principles, they remain unlikely to change any time soon.

Wherever we draw the line for what qualifies as a large dataset, it’s safe to assume that datasets often become large in visualization before they become large in management or analysis. For this reason, Peter Huber, in “Massive datasets workshop: Four years after” wrote: “the art is to reduce size before one visualizes. The contradiction (and challenge) is that we may need to visualize first in order to find out how to reduce size” [36]. To try and help guide us, Ben Shneiderman, in “The eyes have it: A task by data type taxonomy for information visualizations” proposed the Visual Information Seeking Mantra, which says: “Overview first, zoom and filter, then details-on-demand” [37].⁴

Role of Collaboration

Within a business, the exploratory visualization an analyst uses is often the same as the visualization she will present to stakeholders. Explanatory visualizations, on the other hand, such as those seen in infographics, are often reserved for marketing materials. In both cases, visualization helps people communicate, not just because graphics can be appealing, but because there is seldom a more efficient representation of the information (according to Larkin and Simon, this is “Why a diagram is (sometimes) worth ten thousand words” [25]). Despite the communicative power underpinning both exploratory and explanatory visualizations, the collaboration in each is confined to activities before and after the production of the visualization. A more capable solution should allow teams of people to conduct visual analyses together, regardless of spatiotemporal constraints, since modern analytical challenges are far beyond the scope of any single person.

Large and multiscreen environments, like those supported by Jigsaw [14], can help. But in the past decade, an ever-growing need has motivated people to look beyond the office for collaborators—in particular, many of us have turned to the crowd. A traditional view of VA poses the computer as an aid to the human; however, the reverse can sometimes ring more true. When computer scientist Jim Gray went missing at sea, top scientists worked to point satellites over his presumed area. They then posted photos to Amazon’s crowdsourcing service, Mechanical Turk, in order to distribute visual processing across more humans. A number of companies have since come to appreciate the power of such collaboration,⁵ while a number of academic projects, such as CommentSpace [39] and IBM’s pioneering ManyEyes [41], have demonstrated the benefits of asynchronous commenting, tagging, and linking within a VA environment. This is not surprising, as sensemaking is supported by work parallelization, communication, and social organization [40].

Putting It All Together

Today’s most challenging VA applications require a combination of technologies: high-performance computing and database applications (which sometimes including cloud services for data storage and management) and powerful interactions so analysts can tackle large (e.g., even exabyte) scale datasets [10]—but issues remain. While datasets grow, and while computing resources become more inexpensive, cognitive abilities remain constant. Because of this, it is anticipated that they will bottleneck VA without substantial innovation. For example, systems need to be more thoughtful about how they represent evidence and uncertainty.

Next-generation systems will need to do more. As stated by Kristi Morton in “Support the data enthusiast: Challenges for next-generation data-analysis systems”[22], VA must improve in terms of:

combining data visualization and cleaning
data enrichment
seamless data integration
a common formalism

For combining data visualization and cleaning, systems can represent suggestions around what data is not clean, and what cleaning others may have done. If my software informs me of a suspiciously inexpensive unit-price for steel sheets, I should be able to report the data or fix it without concern of invalidating other analysts’ work.

For data enrichment, systems must know what dimensions to analyze so that they can find and suggest relevant, external datasets, which then must be prepared for incorporation. This kind of effort can help analysts find correlations that may otherwise go undiscovered. If I am considering an investment with a particular supplier, for example, I would likely benefit from a risk report released by a third-party vendor or website.

In seamless data integration, systems should take note of the context of the VA, so they can better pull in related data at the right time; for example, zooming-in on a sub-category of transactions can trigger the system to query data about competing or similar categories, nudging me to contemplate my options.

Finally, a common formalism implies a common semantics—one that enables data analysts and enthusiasts alike to visually interact with, clean, and augment underlying data.

Next-generation analytics will require next-generation data management, visualization, interaction design, and collaboration. We take a pragmatic stance in recommending that organizations build a VA infrastructure that will integrate with existing research efforts to solve interdisciplinary projects—this is possible at almost any size. Furthermore, grounding the structure with a real-world problem can facilitate rapid invention and evaluation, which can prove invaluable. Moving forward, organizations should be better-equipped to take advantage of the data they already maintain to make better decisions.

References

[1] Cook, Kristin A., and James J. Thomas. Illuminating the path: The research and development agenda for visual analytics. No. PNNL-SA-45230. Pacific Northwest National Laboratory (PNNL), Richland, WA (US), 2005.

[2] Viégas, Fernanda B., and Martin Wattenberg. “Artistic data visualization: Beyond visual analytics.” Online Communities and Social Computing. Springer Berlin Heidelberg, 2007. 182-191.

[3] Keim, Daniel A., et al., eds. Mastering the information age-solving problems with visual analytics. Florian Mansmann, 2010.

[5] Huang, Dandan, et al. “Personal visualization and personal visual analytics.” Visualization and Computer Graphics, IEEE Transactions on 21.3 (2015): 420-433.

[7] Keim, Daniel, et al. Visual analytics: Definition, process, and challenges. Springer Berlin Heidelberg, 2008.

[10] Wong, Pak Chung, et al. “The top 10 challenges in extreme-scale visual analytics.” IEEE computer graphics and applications 32.4 (2012): 63.

[11] Keim, Daniel, et al. “Challenges in visual data analysis.” Information Visualization, 2006. IV 2006. Tenth International Conference on. IEEE, 2006.

[12] Zhang, Leishi, et al. “Visual analytics for the big data era—A comparative review of state-of-the-art commercial systems.” Visual Analytics Science and Technology (VAST), 2012 IEEE Conference on. IEEE, 2012.

[14] Stasko, John, Carsten Görg, and Zhicheng Liu. “Jigsaw: supporting investigative analysis through interactive visualization.” Information visualization7.2 (2008): 118-132.

[15] Fekete, Jean-Daniel, et al. “The value of information visualization.” Information visualization. Springer Berlin Heidelberg, 2008. 1-18.

[19] Liu, Zhicheng, Biye Jiang, and Jeffrey Heer. “imMens: Real‐time Visual Querying of Big Data.” Computer Graphics Forum. Vol. 32. No. 3pt4. Blackwell Publishing Ltd, 2013.

[21] Stonebraker, Michael, Sam Madden, and Pradeep Dubey. “Intel big data science and technology center vision and execution plan.” ACM SIGMOD Record 42.1 (2013): 44-49.

[22] Morton, Kristi, et al. “Support the data enthusiast: Challenges for next-generation data-analysis systems.” Proceedings of the VLDB Endowment7.6 (2014): 453-456.

[23] Liu, Zhicheng, and Jeffrey Heer. “The effects of interactive latency on exploratory visual analysis."Visualization and Computer Graphics, IEEE Transactions on 20.12 (2014): 2122-2131.

[24] Card, Stuart K., Jock D. Mackinlay, and Ben Shneiderman. Readings in information visualization: using vision to think. Morgan Kaufmann, 1999.

[25] Larkin, Jill H., and Herbert A. Simon. “Why a diagram is (sometimes) worth ten thousand words.” Cognitive science 11.1 (1987): 65-100.

[27] Ma, Kwan-Liu, et al. “Scientific discovery through advanced visualization.” Journal of Physics: Conference Series. Vol. 16. No. 1. IOP Publishing, 2005.

[28] Ware, Colin. Information visualization: perception for design. Elsevier, 2012.

[29] Shneiderman, Ben. “Tree visualization with tree-maps: 2-d space-filling approach.” ACM Transactions on graphics (TOG) 11.1 (1992): 92-99.

[30] Inselberg, Alfred, and Bernard Dimsdale. “Parallel coordinates: A tool for visualizing multivariate relations.” Human-Machine Interactive Systems(1991): 199-233.

[31] Stolte, Chris, Diane Tang, and Pat Hanrahan. “Polaris: A system for query, analysis, and visualization of multidimensional relational databases.” Visualization and Computer Graphics, IEEE Transactions on 8.1 (2002): 52-65.

[32] Unwin, Antony, Martin Theus, and Heike Hofmann.Graphics of large datasets: visualizing a million. Springer Science & Business Media, 2006.

[33] Pearl, Judea. Causality. Cambridge university press, 2009.

[34] Nielsen, Jakob. Usability engineering. Elsevier, 1994.

[35] Shneiderman, Ben. “Direct manipulation: A step beyond programming languages.” ACM SIGSOC Bulletin. Vol. 13. No. 2-3. ACM, 1981.

[36] Huber, Peter J. “Massive datasets workshop: Four years after.” Journal of Computational and Graphical Statistics 8.3 (1999): 635-652.

[37] Shneiderman, Ben. “The eyes have it: A task by data type taxonomy for information visualizations."Visual Languages, 1996. Proceedings., IEEE Symposium on. IEEE, 1996.

[38] Stonebraker, Michael. “The Solution: Data Curation at Scale.” Getting Data Right: Tackling the Challenges of Big Data Volume and Variety. Ed. Shannon Cutt. California, 2015. 5-12. Print.

[39] Willett, Wesley, et al. “CommentSpace: structured support for collaborative visual analysis."Proceedings of the SIGCHI conference on Human Factors in Computing Systems. ACM, 2011.

[40] Heer, Jeffrey, and Maneesh Agrawala. “Design considerations for collaborative visual analytics."Information visualization 7.1 (2008): 49-62.

[41] Viegas, Fernanda B., et al. “Manyeyes: a site for visualization at internet scale.” Visualization and Computer Graphics, IEEE Transactions on 13.6 (2007): 1121-1128.

¹ The date’s attacks required real-time response at an unprecedented scale.

² Only a few decades ago, visualization was unrecognized as a mainstream academic discipline. John Tukey (inventor of the FFT algorithm, box plot, and more) played a key part in its broader adoption, highlighting its role in data analysis.

³ During this time, several academic visualization projects set the groundwork for new visualization techniques and tools. One example is Stanford’s Polaris [31];, an extension of pivot tables that enabled interactive, visual exploration of large databases. In 2003, the project was spun into the commercially available Tableau software. A comparison of commercial systems is provided in [12].

⁴ Keim emphasizes VA in his modification: “Analyze first, show the important, zoom, filter and analyze further, details on demand” [7].

⁵ Tamr, for example, emphasizes collaboration within a VA framework, using machine learning to automate tedious tasks while keeping human experts in the loop for guidance.

Get Getting Analytics Right now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Getting Analytics Right by Nidhi Aggarwal, Byron Berk, Gideon Goldin, Matt Holzapfel, Eliot Knudsen