Enabling reliable, secure collaboration on data science and machine learning projects
A conversation with Paul Taylor, chief architect in Watson Data and AI, and IBM fellow.
Machine learning researchers often prototype new ideas using Jupyter, Scala, or R Studio notebooks, which is a great way for individuals to experiment and share their results. But in an enterprise setting, individuals cannot work in isolation—many developers, perhaps from different departments, need to collaborate on projects simultaneously, and securely. I recently spoke with IBM’s Paul Taylor to find out how IBM Watson Studio is scaling machine learning to enterprise-level, collaborative projects.
First, a bit of background about Taylor. He has enjoyed a distinguished career at IBM over the past 17 years, where he started off working on Db2 and Informix, and working with big data and unstructured data well before those fields exploded. He has held many titles working in different technology areas as a distinguished engineer, chief architect, master inventor, CTO, and this year was appointed as an IBM Fellow.
Today, Taylor leads the technology of IBM Watson data and AI components, where he is exploring the convergence of data, AI, and public cloud with IBM Watson Studio. Watson Studio provides a suite of tools for data scientists, application developers, and subject matter experts to collaborate and work with data to conduct analytics and data science, and to build, train, and deploy models at scale.
Frank Kane: Why is better collaboration in data science important? What sorts of opportunities do you see it creating for real-world developers and businesses?
Paul Taylor: A lot of times I go in to talk to C-suite folks who are running the data science teams. They’re in a real challenge because, traditionally, many of those clients and the scientists are using their own little tools, and they may be very sophisticated tools, or they may be very naïve ones. They’re all working in silos, and they’re using their own tools in their own way.
A lot of times what I’ve seen is that these enterprises are really trying to figure out how to consolidate that a little bit because it’s really hard for those teams to work together and share the same data. That’s the beauty of Watson Studio. You can have these different teams, even if they use different technologies. Some people want to just program themselves, and they’re going to use notebooks and they’re going to use some custom libraries they get from some place.
Other people know the data science algorithms and statistics and so forth, but they’re not really hard-core programmers. A lot of times those individuals want to work together as part of a team for a specific project. These companies, until we showed up, were thinking they were going to have to procure several different tools that would live in isolation and be used by different groups.
They actually got really excited by this notion of, “Wow, if we could actually bring these communities together, still knowing that they can use the special technologies they prefer, but not have to then copy the data, to be able to see each other’s work, and see the comments, and be able to annotate it and collaborate on it, and maybe invite other people into their projects. And do that in a secure way and a very easy way.”
I think that’s what really opens people’s eyes, particularly in an enterprise setting or place where collaboration really needs to happen. Not only the collaboration, but how do these companies learn from their expertise? By having the notion of projects and being able to have different people working on the projects, that sort of shared tribal knowledge starts to get consolidated in a common set of tools and practices.
FK: I’ve seen people try to just put a Jupyter Notebook, like an .ipynb file, on GitHub somewhere and call it collaboration. Obviously that is fraught with problems.
PT: Yes, that’s exactly it. We’re not saying don’t put those things in certain places, but we’re saying that if you can put it in the context of Watson Studio, then you have all of that other ecosystem around it, which lets you share with different users and different people. You can also control the security around that through these access control lists of who has access and what role [they have] within the project.
FK: Right. So it sounds like Jupyter Notebooks are just one of the things you can share across teams in Watson Studio. Are there more systems that surround those notebooks?
PT: That’s a really great question because anybody who’s used notebooks themselves realizes there are still opportunities. When you give a notebook to somebody else, it doesn’t mean that you can actually capture all the interaction that you want between two people who are collaborating.
At the same time, you don’t customize the notebook so much that it’s no longer the same thing they picked up out of open source and it becomes too dissimilar. So, one of the key things we tried to do was use the open source and preserve the integrity of that, but then put some framing around it.
You can have comments, and everybody can comment because it’s in the context of a project. You can put security around the project itself—who gets added, who’s allowed to access it, that kind of thing.
And then, of course, inside the project you have the framing of “Where is the data coming from?” So you can control more of the environment if you like through Watson Studio, which is above and beyond what you can do in a Jupyter Notebook, but at the same time you’re preserving the integrity of what people expect out of a Jupyter Notebook and having it run in a way that’s consistent with that.
FK: Are there any specific organizational challenges involved in having different teams collaborating on the same notebook or project within Watson Studio?
PT: The traditional issue that crops up anytime you’re dealing with very sensitive enterprise data is that people want to know how it is managed. If you have sensitive data in there, where is it and how would you know that, and how are you managing that?
So you get into those kinds of questions. I think this starts with the security of the collaborators, their identities, groups, and roles. For example, in a Jupyter Notebook, you can certainly write code that exposes credentials, for example, which is another type of risk when dealing with enterprise data.
We’re introducing capabilities to minimize that, so that those credentials can be shielded from people. That minimizes the threat of other accidental exposure, of exposing credentials that somebody could then use in a way that wasn’t intended.
FK: Do you find this broader sharing makes the technology more accessible to a large organization and spreads a deeper understanding of what’s going on internally?
PT: Yes, definitely. You can do a lot more. I think that’s what people like in the notebooks—you can put comments in and images and other things, so they can be really expressive.
We’ve even had, inside a project, links to what you’d call a community—where samples are, where there are tutorials, and so forth. That really helps build teams as well, right? You may have somebody who’s a really good data scientist, but there are junior members or interns on the project. Sometimes they need some help.
You can link to those sources directly inside the project, and say, “Here’s a tutorial on how to connect to a Spark system.” Or, “Here’s a tutorial on how to use this type of an algorithm.” You can put that right in the project.
FK: I imagine just having the entire organization on a single, unified platform has a lot of advantages, too. You don’t have to worry about what version of notebooks you’re using and nonsense like that, right?
PT: Yes, absolutely. And the fact that it’s on the cloud means we’re delivering changes—pretty much every day in some cases—into that system. That’s the whole agility aspect of the cloud and a fully managed service, which is really powerful because they keep getting more capabilities and we can see what people are doing and what they’re asking for. Even inside the tooling itself you can ask for help.
If you think back a few years ago, it was unheard of to have software updates applied continuously every day without any downtime and not have to involve a lot of IT skills to keep it all running and current. This notion of self service is really huge. You’re able to create a project yourself. You don’t have to ask somebody’s permission for that. You can start to do things that you previously had to go through various approvals to procure. I think that’s another major theme when you start to couple open source technologies, but put in a cloud context, so now you can provision and also put in a context where everybody can access it.
There are other interesting use cases where you’re working with partners. You need to work across organizations at that point. Having Watson Studio in the cloud makes it easier to do that type of collaboration with projects and catalogs, where you’ve got these hand-offs between different organizations.
FK: Being in the cloud makes it much more scalable as well. You don’t have to worry about somebody’s notebook server on their desktop not being up and things like that.
PT: Yes, it’s actually a lot of the operational aspects, right? Not having to worry about the project being available, online, reliable. Ensuring all the latest security and updates are in place, having all the various industry compliance certifications, backups, monitoring, etc.
Not just the elasticity, but is it being kept up? As you mentioned earlier, how are the upgrades happening? All those kinds of things they don’t have to worry about anymore.
FK: I want to congratulate you on being named an IBM fellow this year. It’s a very exclusive club. Tell us what that means for someone outside of IBM.
PT: It’s a big honor and responsibility. Essentially the role there and what it means is obviously a lot of deep technical expertise. It’s really that combined with having proven delivery of products and technologies that have helped companies and helped enterprises really adopt the technology that is being worked on and actually deploying it.
Then the third part is around the technical mentoring and building teams, and all the teamwork that goes with it. So it’s a combination of leadership, the technical expertise across many projects that I’ve done and the ones I’m currently engaged in, and then around actually having a real business impact.
It’s for real impact across all of those dimensions, concurrently. That’s what IBM fellows are all about, the combination of those three things and applying them broadly. Not just inside the company, but across the industry and across the ecosystem of partners and system integrators as well.
FK: Where does Watson Studio go from here?
PT: “Where does this go” is a natural question. We’ve got these really powerful tools. I think what you’ll see more of us doing is taking this to the next level on production use and productive use for the enterprise, even beyond what we have. Our approach to enterprise includes an intelligent catalog that assists in finding and recommending relevant assets to use in a project, and applies active policy management on the assets so collaborators in a project automatically stay within enterprise policies. In addition, we have applied a lot of energy to three major themes: first, learning from “small data”—many enterprises don’t have as much data as some people imagine, particularly relative to the largest social media sites, yet the data they have is very valued. So we have developed techniques and strategies to quickly learn with less data, yet still retain high quality, and remove potential bias from that data. Second, being able to explain the reasoning behind the results, providing complete transparency and traceability. Third, being able to really scale the whole environment in an enterprise context. This tool is now really powerful, and people are asking us, “Okay, how do I really blow this out in my enterprise applications? How do I easily create these intelligent applications, AI-infused applications, into my key business processes with efficiency in all the languages, regions, and data centers I operate in?”
Part of that gets into more of the run time and operations side of the house. There are a lot of things coming that I think are very interesting here.
FK: It feels very much like day one again in the field of AI, and it’s an exciting time to be in.
PT: I think it’s really interesting because in the 80s, I realized that AI was very enticing, but it just simply wasn’t practical. Now, it’s turned around completely opposite. It’s actually not practical to build applications without AI.
This post is a collaboration between O’Reilly and IBM. See our statement of editorial independence.