Chapter 1. It’s Time to Rethink Data Management
In 2016, Microsoft deployed an AI chatbot named Tay on Twitter. Tay had to be shut down only 16 hours after it launched because a number of its nearly 100,000 tweets mimicked offensive and controversial language used by Twitter users that the bot was supposed to be learning from.
Peter Lee, corporate vice president of Microsoft Healthcare, posted the following apology: “We are deeply sorry for the unintended offensive and hurtful tweets from Tay, which do not represent who we are or what we stand for, nor how we designed Tay.”1
This is an example of how badly a brand, even a leader in its industry, can be hit when data-based decisions go “freestyle.” It also shows how a simple data issue, such as missing or incomplete data, can have significant repercussions.
With the tremendous growth in the volume of data that organizations are collecting, and the scale of its usage, data issues are occurring more often. By 2025, it’s estimated that the global volume of data will expand to 180 zettabytes, more than double the volume of data in 2020. Fueling this dramatic growth is the fact that almost every function of every organization now generates data. It’s also being driven by the advancement of open source machine learning, which relies on large datasets for training models.
To manage this exploding data growth, data teams have risen in importance and size. Small data teams with few stakeholders have been replaced by large teams that must ...
Get What Is Data Observability? now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.