Chapter 22Unlocking the Potential of AI with Open Data

—Anthony Cintron Roman and Kevin Xu

Executive Summary

With over 100 million users, GitHub is the world's largest platform for collaborative software development. Also used extensively for open data collaboration—which means that data are freely and readily available to users—GitHub hosts more than 800 million open data files, totaling 142 terabytes of data. Here, we consider the potential of open data on GitHub and how open data can accelerate AI research. We explore GitHub's open data landscape and patterns of how users share datasets. We found that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets in the recent past. Leading by example, we released the three datasets that we collected to support this analysis as open datasets. By examining the open data landscape on GitHub, we sought to empower users and organizations to leverage existing open datasets and improve their discoverability—ultimately contributing to the ongoing AI revolution and its opportunities to help address complex societal issues.

Why Is This Important?

Artificial intelligence has the potential to facilitate digital innovation, promote experimentation, improve efficiency, and accelerate progress in addressing societal issues. By providing large quantities of data that are readily available for use in developing AI-powered models, open data is foundational to realizing this ...

Get AI for Good now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.