Domino effect.
Domino effect. (source: Pokipsy76~commonswiki on Wikimedia Commons)

In this episode of the O’Reilly Data Show, I spoke with John Akred, cofounder and CTO of Silicon Valley Data Science. Akred and his colleagues teach two of the more popular Strata + Hadoop World tutorials—“Developing a Modern Enterprise Data Strategy” and “Architecting a Data Platform.” We talked about his career in data science and consulting, and his penchant for bringing emerging technologies and tools into large enterprises.

Here are some highlights from our conversation:

Developing an enterprise data strategy

Our conception of what an enterprise data strategy should be is something we cover in our Strata tutorial. We start with the ambitions of a business and create a roadmap of technology, people, and capability investments to unlock that business potential. We go through a period of how you recognize and then sort of decompose a business's ambition into concrete business objectives, which are things you can actually go out and do, or build. And then from those, understand what technical workloads and things are required—“I need a database, I need an algorithm that does a recommendation, or …”—to unlock that business objective.

You don't actually have to be technical, because this is really about understanding and mapping business priority to your technical investments. The discussion of technical investments themselves stay at a pretty high level. You need to understand the modern application of data, and our customers typically understand their business and what's going on there very, very well. We bring the perspective of understanding the technology and the art of the possible in those engagements. If you were to actually leave our tutorial to try and go implement a data strategy yourself, the technical skills you would need would be an understanding of the art of the possible with respect to data technology.

Agile development methods in data science

The idea is that any data science activity has a kind of uncertainty, where you don't know exactly how you're going to solve the problem. You have probably a pretty good idea that you can because people have solved similar problems, and you probably have a pretty good idea of how you're going to start. But we can't say that it's going to take two months, that we will be at this point in week three of month two, and if we're not, we're behind schedule, and if we've already done that, we're ahead of schedule, because the projects don't play out that way.

The advantage of Agile, which is a concept that started in the software development world, is around rapid, iterative product development and getting rapid feedback cycles from customers.

… In the morning, you have stand-ups, so you have devices to share status, to coordinate with your team, and things like that that are very efficient, quick, and lightweight. It's a process that's designed to manage rapid, iterative work. Now, engineering a product versus engineering a data science solution are slightly different in that data science tends to be less deterministic; although, both of them have plenty of creativity involved, and both groups spend a lot of time staring at the wall trying to think of their approach to something. But the benefit of Agile is that it manages the execution of these things in cycles where you learn something, you share those results with your customer, whether that's a physical company's customer or your internal stakeholder in your organization, and you take that feedback and move forward.

… It's definitely something that we use in our projects with our customers to be successful. … I get asked to talk to some companies’ internal teams about this, and it's interesting because a lot of the leadership want someone else to come in and help their teams understand why we do this. Because in data science, there's a lot of push back on process, generally, and what I always say to my team and others is, "Look, this is the lightest-weight way we've come up with doing these things. If you can come up with an easier, lighter-weight way, we're all for it."

… I think Agile will become more prevalent in data science because it's a nice, light way to manage progress. … We have customers telling us all the time that it's made them much more productive. It's not made them better data scientists; it's created a world where, for instance, people spend less time working on a model that ultimately is a dead end, because they have much more frequent conversations with the ultimate consumers.

Related resources: