Ask your data new questions

Consolidating data across silos improves business insight.

By Shannon Cutt and Ben Lorica

November 3, 2015

"33 Bridges" or "the Bridge of 33 Arches," also called the "Allah-Verdi Khan Bridge." (source: Reza Haji-pour on Wikimedia Commons)

During a special edition of The O’Reilly Podcast, host and O’Reilly chief data scientist Ben Lorica interviewed Nidhi Aggarwal, whose background is in hardware and software, and whose experience ranges from work in engineering, consulting, research, and entrepreneurship.

Aggarwal is now the strategy and marketing lead at Tamr, which uses data unification and analytics to help companies improve revenue growth through a consolidated view of their data.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The following is an edited transcript of their chat. For the full interview, see the player at the end of this post or download the episode on SoundCloud.

Key takeaways from their chat:

Everybody is producing a lot of data, and wants to make a decision based on a lot of data, but they don’t necessarily have a way of realizing that ambition. Companies fall into comfortable patterns of answering questions they have always answered.
Context is really important. Data in the absence of context is likely to lead to bad insight. Depending on the context, the meaning of the data itself even changes.
People want to use different tools — even spreadsheets. Tamr offers a thin layer of software that can analyze spreadsheet data in a cohesive manner.

Ben Lorica: One of the interesting things on your resume is your stint at McKinsey, which probably gave you insight into how some famous, large companies deal with data. Let’s talk a little bit about that.

Nidhi Aggarwal: What I saw at McKinsey was that famous, large companies have a huge desire to be data driven … Everybody is producing a lot of data, and wants to make a decision based on a lot of data, but they don’t necessarily have a way of realizing that ambition. … There are still a lot of silos around. Companies ask questions like: ‘Okay how much data can we use?’, ‘Do I even know where the data is?’, ‘What data is most relevant?’, and ‘How do I prepare the data to answer the questions?’. So, what I saw over and over again was that people intuitively want to use more data, and be more data driven, but they fall into the comfortable patterns of answering the questions that they have always answered.

BL: Right, because they already have the reports, and it’s convenient and maybe there is a certain amount of bureaucracy that prevents them from collecting and matching up disparate data sources, right?

NA: Exactly, right. And I think also what has happened in the market is there is so much focus on the technology around big data and less around the art of using the big data to answer the right questions. And that has also frozen business people because they do not know what to do for Spark, or Yarn, or any of this buzzwords flying around, and it’s very confusing.

BL: What I’ve seen in an informal way is that spending at big data startups is moving away from IT more toward the line of business people. Is that the kind of trend you saw when you were at McKinsey — that business people are more empowered, now that tools are easier?

NA: I would say business people are frustrated by the lack of progress from IT. It was not their desire to take over the IT part of answering questions, but in the words of one of our customers, for example, if they ask a question, it takes IT sometimes eight months to stitch the data together before they can even begin the analysis. In that time, the business opportunity is lost, right? … IT organizations that have been very effective in partnering with business people haven’t lost their budget. … In fact, to the contrary, they are getting more on the table — they are the ones bringing the solutions to the business.

BL: At some point, you went from having your own startup to working at Tamr. What, in particular, about Tamr peaked your interest?

NA: The biggest thing for me was the problem Tamr was solving. The silo data and the silo analytics problem is something I saw over and over again. … At McKinsey, we would routinely go into client situations, solve a problem for a client, and then see that the different business units won’t even talk to each other, won’t even have a process of sharing data. These are people who are working together toward the company being successful, and that lack of sharing was really mind-boggling. And it was not that people didn’t have a desire to share; it was more that the systems and the processes and the tools were set up in such a silo that it was really hard to bring it together, even though the marketing people would like to know what they are doing in marketing, how is that affecting sales and get that feedback. … It was just challenging to do that.

… It’s not a problem we can solve just with humans, and it’s not a problem we can solve just with technology. We really need to connect the people and the technology together. … So, one of the things we say is, ‘machine driven, human guided.’

BL: What has the team at Tamr done to alleviate the whole problem of reproducibility/ repeatability of complex data pipelines in data projects?

NA: At Tamr, there are two ways in which we think about repeatable analytics. … One is, the scientific definition of repeatability — given the same analytics, somebody should be able to reproduce it, so that you take out the inherent biases.

The second part of repeatability is, if somebody has done the work already, we take that work and are able to extend it — so repeatably, we are able to do better analytics using more data.

BL: So a typical company will have many different systems — many different data sources and IT systems. What are some good strategies for making an organization’s relevant data available for analysis?

NA: One of the central things that is missing is that everybody will talk about how the data is their asset. But really look at it. How are we managing it? For financial assets, you go to the CFO, and the CFO can tell you exactly all the bank accounts, how much money they have, who has access to that, who is using money in that way. If data truly is an asset, we should start managing it that way. We should have the CDO and the CIO being able to tell you where all the information is.

BL: Interesting. So, we have heard a lot about silos. What specific strategies would you give people for reducing analytic silos?

NA: Number one is breaking down the data silos so you have a catalog and know where all of your data is — where it lives, where it’s coming from, who uses it, and what it’s being used for. … If we give visibility for all of the data within the enterprise, your business analyst will be able to think: ‘Okay, what’s an interesting analysis I can do if I have this data available?’ So, having this transparent catalog, having a predictable way of preparing that data, and tying it to the business questions, end-to-end, is the best way to resolve these analytical silos.

BL: Tamr is one of the companies that really has done a good job of building a system that combines the best of humans and machines, and also has done a good job in educating the data space about this type of human-in-the-loop system. So, generally — how much of the recent developments in analytics are about this balance between human expertise and machine learning solutions?

NA: A huge part of it is this balance between human expertise and machine learning systems. Context is really important. Data in the absence of context is likely to lead to bad insight. … And, depending on the context, the meaning of the data itself changes. For example, financial services and manufacturing are examples of places where, if you don’t have the right context, the same data can mislead you to bad insight.

So, a lot of the analytics that we are performing today, and that are actually effective, tend to have this combination of bringing the right human expertise, at the right time, but then scaling it using machine learning so that the human does not have to be constantly involved in order to scale to this large data set.

BL: We talked a lot about organizations having many different data sources, but actually, in many companies, there’s still just one source of data. The type of source data that is really prevalent is spreadsheets. Does your team at Tamr encounter that a lot?

NA: It’s a huge challenge and I will give you an example from one of our former customers who spent billions of dollars on research and development — they actually had a bunch of R&D scientists recording the experiments and observations in spreadsheets. … They had no cohesive place to determine whether it was effective or to determine whether different scientists were working on the same thing. Could they benefit from each of the experiments? There was just no place to do analysis like that. By using Tamr, they were able to bring the data from their 27,000 spreadsheets together, and for the first time, see comprehensively where they were spending their R&D dollars.

BL: So how did that story end? Are they still using spreadsheets?

NA: They are still using spreadsheets. One of things that Tamr embraces is, we are not trying to rip and replace current processes. We don’t say that ‘if only you used a particular version of software on some database, then all of your problems will go away.’ Because that’s what the big enterprise database software companies try to sell, right? It hasn’t worked out, right? People want to use different tools. People want to use spreadsheets. At Tamr, we think that embracing that variety and skill is what is needed. Let people use the spreadsheets, but give them a thin layer of software that can help them analyze all of those spreadsheets in a cohesive manner, and that’s what our product does for them.

BL: In closing, tell us what you think are some of the best practices for organizations to be able to answer their most pressing business questions.

NA: Use all of your relevant data. Make efforts to make that data visible, available, and accessible. Do thoughtful data engineering around it, so that you don’t have to start from scratch for each new question. Do that data prep across the silos so that you can scale in a time-sensitive manner. And then answer questions repeatedly — don’t only have verifications of the questions that have been answered before, but build on that. Accelerate your analytics by doing these things so that you’re spending most of your time asking new questions, answering more questions, and getting to better insight using more of the data.

You can download Tamr’s Catalog product for free here, and you can listen to the complete podcast episode in the player below or download it through our SoundCloud playlist.

This post is a collaboration between O’Reilly and Tamr. See our statement of editorial independence.

Post topics: Big Data Tools and Pipelines