Chapter 2. Diving into Data Programming with Snorkel

Before the advent of deep learning, data scientists would spend most of their time on feature engineering: working to craft features that enabled models to perform better on some metric of choice. Deep learning, with its ability to discover features from data, has freed data scientists from feature engineering and has shifted their efforts toward other tasks like understanding the selected features, hyperparameter tuning, and robustness checks.

With the recent advent of the “deep learning era,” data engineering has rapidly become the most expensive task in terms of both time and expense. This task is particularly time-consuming for scenarios in which data is not labeled. Enterprises gather a lot of data, but a good part of that is unlabeled data. Some enterprise scenarios, like advertising, naturally enable gathering raw data and their labels at the same time. To measure whether the advertising presented to the user was a success or not, for instance, a system can log data about the user, the advertisement shown, and whether the user clicked on the link presented, all as one record of the dataset. Associating the user profile with this data record creates a ready-to-use labeled dataset.

However, applications such as email, social media, conversation platforms, etc., produce data that cannot be easily associated with labels at the time the dataset is created. To make this data usable, machine learning practitioners must first ...

Get Practical Weak Supervision now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.