Five principles for applying data science for social good

How to go from well-intentioned efforts to lasting impact with your data projects.

By Jake Porway
October 1, 2015
Indiana World War Memorial stairs Indiana World War Memorial stairs (source: Wikimedia Commons)

Editor’s note: Jake Porway expanded on the ideas outlined in this piece in his Strata + Hadooop World NYC 2015 keynote address, “What does it take to apply data science for social good?

“We’re making the world a better place.” That line echoes from the parody of the Disrupt conference in the opening episode of HBO’s “Silicon Valley.” It’s a satirical take on our sector’s occasional tendency to equate narrow tech solutions like “software-designed data centers for cloud computing” with historical improvements to the human condition.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Whether you take it as parody or not, there is a very real swell in organizations hoping to use “data for good.” Every week, a data or technology company declares that it wants to “do good” and there are countless workshops hosted by major foundations musing on what “big data can do for society.” Add to that a growing number of data-for-good programs from Data Science for Social Good’s fantastic summer program to Bayes Impact’s data science fellowships to DrivenData’s data-science-for-good competitions, and you can see how quickly this idea of “data for good” is growing.

Yes, it’s an exciting time to be exploring the ways new datasets, new techniques, and new scientists could be deployed to “make the world a better place.” We’ve already seen deep learning applied to ocean health, satellite imagery used to estimate poverty levels, and cellphone data used to elucidate Nairobi’s hidden public transportation routes. And yet, for all this excitement about the potential of this “data for good movement,” we are still desperately far from creating lasting impact. Many efforts will not only fall short of lasting impact — they will make no change at all.

At DataKind, we’ve spent the last three years teaming data scientists with social change organizations, to bring the same algorithms that companies use to boost profits, to mission-driven organizations in order to boost their impact. It has become clear that using data science in the service of humanity requires much more than free software, free labor, and good intentions.

So how can these well-intentioned efforts reach their full potential for real impact? Embracing the following five principles can drastically accelerate a world in which we truly use data to serve humanity.

1. “Statistics” is so much more than “percentages”

We must convey what constitutes data, what it can be used for, and why it’s valuable.

There was a packed house for the March 2015 release of the No Ceilings Full Participation Report. Hillary Clinton, Melinda Gates, and Chelsea Clinton stood on stage and lauded the report, the culmination of a year-long effort to aggregate and analyze new and existing global data, as the biggest, most comprehensive data collection effort about women and gender ever attempted. One of the most trumpeted parts of the effort was the release of the data in an open and easily accessible way.

I ran home and excitedly pulled up the data from the No Ceilings GitHub, giddy to use it for our DataKind projects. As I downloaded each file, my heart sunk. The 6MB size of the entire global dataset told me what I would find inside before I even opened the first file. Like a familiar ache, the first row of the spreadsheet said it all: “USA, 2009, 84.4%.”

What I’d encountered was a common situation when it comes to data in the social sector: the prevalence of inert, aggregate data. Huge tomes of indicators, averages, and percentages fill the landscape of international development data. These datasets are sometimes cutely referred to as “massive passive” data, because they are large, backward-looking, exceedingly coarse, and nearly impossible to make decisions from, much less actually perform any real statistical analysis upon.

The promise of a data-driven society lies in the sudden availability of more real-time, granular data, accessible as a resource for looking forward, not just a fossil record to look back upon. Mobile phone data, satellite data, even simple social media data or digitized documents can yield mountains of rich, insightful data from which we can build statistical models, create smarter systems, and adjust course to provide the most successful social interventions.

To affect social change, we must spread the idea beyond technologists that data is more than “spreadsheets” or “indicators.” We must consider any digital information, of any kind, as a potential data source that could yield new information.

2. Finding problems can be harder than finding solutions

We must scale the process of problem discovery through deeper collaboration between the problem holders, the data holders, and the skills holders.

In the immortal words of Henry Ford, “If I’d asked people what they wanted, they would have said a faster horse.” Right now, the field of data science is in a similar position. Framing data solutions for organizations that don’t realize how much is now possible can be a frustrating search for faster horses. If data cleaning is 80% of the hard work in data science, then problem discovery makes up nearly the remaining 20% when doing data science for good.

The plague here is one of education. Without a clear understanding that it is even possible to predict something from data, how can we expect someone to be able to articulate that need? Moreover, knowing what to optimize for is a crucial first step before even addressing how prediction could help you optimize it. This means that the organizations that can most easily take advantage of the data science fellowship programs and project-based work are those that are already fairly data savvy — they already understand what is possible, but may not have the skill set or resources to do the work on their own. As Nancy Lublin, founder of the very data savvy and Crisis Text Line, put it so well at Data on Purpose — “data science is not overhead.”

But there are many organizations doing tremendous work that still think of data science as overhead or don’t think of it at all, yet their expertise is critical to moving the entire field forward. As data scientists, we need to find ways of illustrating the power and potential of data science to address social sector issues, so that organizations and their funders see this untapped powerful resource for what it is. Similarly, social actors need to find ways to expose themselves to this new technology so that they can become familiar with it.

We also need to create more opportunities for good old-fashioned conversation between issue area and data experts. It’s in the very human process of rubbing elbows and getting to know each other that our individual expertise and skills can collide, uncovering the data challenges with the potential to create real impact in the world.

3. Communication is more important than technology

We must foster environments in which people can speak openly, honestly, and without judgment. We must be constantly curious about each other.

At the conclusion of one of our recent DataKind events, one of our partner nonprofit organizations lined up to hear the results from their volunteer team of data scientists. Everyone was all smiles — the nonprofit leaders had loved the project experience, the data scientists were excited with their results. The presentations began. “We used Amazon RedShift to store the data, which allowed us to quickly build a multinomial regression. The p-value of 0.002 shows …” Eyes glazed over. The nonprofit leaders furrowed their brows in telegraphed concentration. The jargon was standing in the way of understanding the true utility of the project’s findings. It was clear that, like so many other well-intentioned efforts, the project was at risk of gathering dust on a shelf if the team of volunteers couldn’t help the organization understand what they had learned and how it could be integrated into the organization’s ongoing work.

In many of our projects, we’ve seen telltale signs that people are talking past each other. Social change representatives may be afraid to speak up if they don’t understand something, either because they feel intimidated by the volunteers or because they don’t feel comfortable asking for things of volunteers that are so generously donating their time. Similarly, we often find volunteers that are excited to try out the most cutting-edge algorithms they can on these new datasets, either because they’ve fallen in love with a certain model of Recurrent Neural Nets or because they want a dataset to learn them with. This excitement can cloud their efforts and get lost in translation. It may be that a simple bar chart is all that is needed to spur action.

Lastly, some volunteers assume nonprofits have the resources to operate like the for-profit sector. Nonprofits are, more often than not, resource-constrained, understaffed, under appreciated, and trying to tackle the world’s problems on a shoestring budget. Moreover, “free” technology and “pro bono” services often require an immense time investment on the nonprofit professionals’ part to manage and be responsive to these projects. They may not have a monetary cost, but they are hardly free.

Socially-minded data science competitions and fellowship models will continue to thrive, but we must build empathy — strong communication through which diverse parties gain a greater understanding of and respect for each other — into those frameworks. Otherwise we’ll forever be “hacking” social change problems, creating tools that are “fun,” but not “functional.”

4. We need diverse viewpoints

To tackle sector-wide challenges, we need a range of voices involved.

One of the most challenging aspects to making change at the sector level is the range of diverse viewpoints necessary to understand a problem in its entirety. In the business world, profit, revenue, or output can be valid metrics of success. Rarely, if ever, are metrics for social change so cleanly defined.

Moreover, any substantial social, political, or environmental problem quickly expands beyond its bounds. Take, for example, a seemingly innocuous challenge like “providing healthier school lunches.” What initially appears to be a straightforward opportunity to improve the nutritional offerings available to schools quickly involves the complex educational budgeting system, which in turn is determined through even more politically fraught processes. As with most major humanitarian challenges, the central issue is like a string in a hairball wound around a nest of other related problems, and no single strand can be removed without tightening the whole mess. Oh, and halfway through you find out that the strings are actually snakes.

Challenging this paradigm requires diverse, or “collective impact,” approaches to problem solving. The idea has been around for a while (h/t Chris Diehl), but has not yet been widely implemented due to the challenges in successful collective impact. Moreover, while there are many diverse collectives committed to social change, few have the voice of expert data scientists involved. DataKind is piloting a collective impact model called DataKind Labs, that seeks to bring together diverse problem holders, data holders, and data science experts to co-create solutions that can be applied across an entire sector-wide challenge. We just launched our first project with Microsoft to increase traffic safety and are hopeful that this effort will demonstrate how vital a role data science can play in a collective impact approach.

5. We must design for people

Data is not truth, and tech is not an answer in-and-of-itself. Without designing for the humans on the other end, our work is in vain.

So many of the data projects making headlines — a new app for finding public services, a new probabilistic model for predicting weather patterns for subsistence farmers, a visualization of government spending — are great and interesting accomplishments, but don’t seem to have an end user in mind. The current approach appears to be “get the tech geeks to hack on this problem, and we’ll have cool new solutions!” I’ve opined that, though there are many benefits to hackathons, you can’t just hack your way to social change.

A big part of that argument centers on the fact that the “data for good” solutions we build must be co-created with the people at the other end. We need to embrace human-centered design, to begin with the questions, not the data. We have to build with the end in mind. When we tap into the social issue expertise that already exists in many mission-driven organizations, there is a powerful opportunity to create solutions to make real change. However, we must make sure those solutions are sustainable given resource and data literacy constraints that social sector organizations face.

That means that we must design with people, accounting for their habits, their data literacy level, and, most importantly, for what drives them. At DataKind, we start with the questions before we ever touch the data and strive to use human-centered design to create solutions that we feel confident our partners are going to use before we even begin. In addition, we build all of our projects off of deep collaboration that takes the organization’s needs into account, first and foremost.

These problems are daunting, but not insurmountable. Data science is new, exciting, and largely misunderstood, but we have an opportunity to align our efforts and proceed forward together. If we incorporate these five principles into our efforts, I believe data science will truly play a key role in making the world a better place for all of humanity.

What’s next

Almost three years ago DataKind launched on the stage of Strata + Hadoop World NYC as Data Without Borders. True to their motto to “work on stuff that matters,” O’Reilly has not only been a huge supporter of our work, but arguably one of the main reasons that our organization can carry on its mission today.

That’s why we could think of no place more fitting to make our announcement that DataKind and O’Reilly are formally partnering to expand the ways we use data science in the service of humanity. Under this media partnership, we will be regularly contributing our findings to O’Reilly, bringing new and inspirational examples of data science across the social sector to our community, and giving you new opportunities to get involved with the cause, from volunteering on world-changing projects to simply lending your voice. We couldn’t be more excited to be sharing this partnership with an organization that so closely embodies our values of community, social change, and ethical uses of technology.

We’ll see you on the front lines!

Post topics: AI & ML, Data
Post tags: building a data culture

Get the O’Reilly Radar Trends to Watch newsletter