One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.
At Strata + Hadoop World New York, NY, two presentations from academic spinoff start-ups — Mike Stonebraker of Tamr and Joe Hellerstein and Sean Kandel of Trifacta — focused on data preparation and curation. While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.
Scalability ~ data variety and size
Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.
Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.
Empower domain experts
In many instances, you need subject area experts to explain specific data sets that you’re not familiar with. These experts can place data in context and are usually critical in helping you clean and consolidate variables. Trifacta has tools that enable non-programmers to take on data wrangling tasks that used to require a fair amount of scripting.
Consider DSLs and visual interfaces
“Programs written in a [domain specific language] (DSL) also have one other important characteristic: they can often be written by non-programmers…a user immersed in a domain already knows the domain semantics. All the DSL designer needs to do is provide a notation to express that semantics.”
I’ve often used regular expressions for data wrangling, only to come back later unable to read the code I wrote (Joe Hellerstein describes regex as “meant for writing & never reading again”). Programs written in DSLs are concise, easier to maintain, and can often be written by non-programmers.
Trifacta designed a “readable” DSL for data wrangling but goes one step further: their users “live in visualizations, not code.” Their elegant visual interface is designed to accomplish most data wrangling tasks, but it also lets users access and modify accompanying scripts written in their DSL (power users can also use regular expressions).
These ideas go beyond data wrangling. Combining DSLs with visual interfaces can open up other aspects of data analysis to non-programmers.
Intelligence and automation
If you’re dealing with thousands of data sources, then you’ll need tools that can automate routine steps. Tamr’s next-generation extract, transform, load (ETL) platform uses machine learning in a variety of ways, including schema consolidation and expert (crowd) sourcing.
Many data analysis tasks involve a handful of data sources that require painstaking data wrangling along the way. Scripts to automate data preparation are needed for replication and maintenance. Trifacta looks at user behavior and context to produce “utterances” of its DSL, which users can then edit or modify.
Don’t forget about replication
If you believe the adage that data wrangling consumes a lot of time and resources, then it goes without saying that tools like Tamr and Trifacta should produce reusable scripts and track lineage. Other aspects of data science — for example, model building, deployment, and maintenance — need tools with similar capabilities.