In this special two-segment episode of the Data Show, I spoke with Dafna Shahaf, assistant professor at the School of Computer Science and Engineering at the Hebrew University of Jerusalem. Her area of research is focused on tools and techniques for overcoming information overload, an area of increasing importance in an attention economy. With the upcoming U.S. Presidential Elections right around the corner, I included a conversation between Jenn Webb, host of the O’Reilly Radar Podcast, and Sam Wang, co-founder of the Princeton Election Consortium and professor of neuroscience and molecular biology at Princeton University.
Below are highlights from my conversation with Dafna Shahaf:
The input for a metro map [a methodology for creating structured summaries of information that explicitly captures temporal dynamics, extracts major story lines, and shows how pieces of information relate to each other] is a set of articles that you want to summarize and organize. The output looks like a subway map (or a metro map), where each line is a coherent story line, and different lines focus on different aspects that can intersect and overlap.
One of my favorite examples of a metro map is when we applied it to the debt crisis in Europe, where you had one line about what Germany was doing and one line about the strikes and riots in Greece, and another line about the International Monetary Fund. You're supposed to look at this and guess that those are the major story lines and that this is how everything is connected.
The really hard part behind this map was crafting the objective function to mathematically optimize values.
Coming up with a map, you don't really know what you're looking for, right? If I show you a map, though, you know it's good. Everything is intuitive, so we had to define some properties that would make a map better—like connectivity, meaning that if two lines are related, then the map should show them intersecting.
The other property we needed to define was coverage. This meant that we should indeed cover things that are important to the user as well as other diverse items. In the case of news, we defined the items that we wanted to cover as words.
Pushing the envelope on what computers can do
There has always been this battle for mankind's sense of uniqueness. You know how psychology articles always say that humans are the only animals that can do X? It was thought that we were the only animals with language skills until they found that chimpanzees could do sign language. We're the only animals that can use tools, until we observed birds using tools. Originally, this battle was waged against animals, but more recently, it’s become against computers.
We keep on hearing those things like, computers will never be able to play chess, or to play Go, and we know how well that assumption went for them.
My point is that even today there are those areas that are considered outside the reach of computer science. Such as, computers will never be creative, or will never have a sense of humor. To me, those areas are the most interesting places to do research. When somebody tells me that computers can't do something (like have a sense of humor), I immediately start thinking about how to get them to, and what kind of data would be useful for me to work with.
The Aha! Moment: from data to insight, Dafna Shafaf’s Strata + Hadoop World 2014 presentation
From search to distributed to large-scale information extraction: a conversation with Hadoop co-founder, Mike Cafarella (around minute 24:00, Mike starts describing applications of DeepDive)