BayesDB: Data science is a communication problem
Query languages, like BQL, offer a bridge between domain experts and software experts.
Fred Brooks may be most famous for The Mythical Man Month, but his greatest essay, in my opinion, is No Silver Bullet. Brooks recognized that our civilization, in adopting computing, is fascinated with tools that make things go faster. Tools are great. We understand tools. We like to build tools. We get to sell tools. But computing is not a tools problem. Computing is a communications problem. We need to enable communication between those who understand software and those who understand reality.
Data science is currently in a tools rush where we hope that the right graphical environment will eliminate the need for qualified data scientists. However, no graphical environment can overcome the fundamental information asymmetry and complexity inherent in modeling data. Responsibility for modeling data cannot be delegated to a marketing specialist or a clinician. If we try to delegate that responsibility, we will have simply abdicated it.
“Programs must be written for people to read, and only incidentally for machines to execute.”—Gerald Jay Sussman and Hal Abelson
Data science needs a query language
The solution to a communications problem is not a tool; it’s a language. Data science needs a language that facilitates understanding of what questions we want to ask, what questions can be answered, and how confident we are in those answers.
We’ve seen this before. Data management and reporting used to have this problem. Organizations needed hundreds or thousands of custom systems just to do report generation, extracting data from operational systems. Relational databases introduced a general purpose query abstraction with the relational model and the SQL language. SQL defined the questions it made sense to answer, such that data architecture could be handled by a smaller number of specialists. A new role for business analysts was created with the responsibility for formulating and understanding questions.
“Business intelligence made it easy for people to ask the questions databases already knew how to answer.”—Bruce Golden
The benefit of a linguistic abstraction is its constraints. It’s like poetry. Writing a reasonable haiku is accessible to people of only modest skill who take the time to learn the rules. Blank verse, at least where intended for others to enjoy, is best left to experts. SQL, for all its limitations, is a highly structured language that can be both read and written by people at a range of technical abilities. Relational algebra gave us the fundamental model to create the right kind of tools in decision support and business intelligence to make SQL and database queries even more accessible.
“Humans understand and communicate uncertainty with stories; reasoning from cases and examples. Probabilistic programming enables computers to do the same.”—Vikash Mansinghka
So what would a query language for data science look like and how would it work? BayesDB, from the MIT Probabilistic Computing Project, presents an answer. The Bayesian Query Language (BQL) provides three key verbs that encompass the full range of inference questions: SIMULATE
, INFER
, and ESTIMATE
. Based on the abstraction of Generative Population Models, BQL insulates questions from the details of models and model creation, enabling models of populations to evolve independently of queries against those populations.
INFER salary WITH CONFIDENCE 0.7 FROM Candidates WHERE salary IS NULL
By building with a query abstraction, we create a language for business analysts and domain experts to participate in the data science process, to contribute to and explore the questions (as the one shown above) that will be of relevance to insight and decisions, to understand how well existing models and predictors perform, and to identify areas for improvement. We enable data scientists and statisticians and systems programmers to work beneath this abstraction and deliver better models, more accurate results, and faster answers. This is the communication we need.
We cannot address our civilization’s need for more data insight by doing more of the same data science faster. We need a way to embrace the asymmetry of knowledge between domain experts and software experts. Languages give us a way to transfer knowledge between humans and computers. We’ve done this before with databases. BayesDB showcases a parallel approach to data science.
To learn more about BayesDB, sign up for updates or the alpha program, and to download it, visit http://probcomp.csail.mit.edu/bayesdb.
Correction note: A previous version of this post incorrectly implied that Fred Brooks was deceased.