Data in Software Systems – Text, Images, Code, and Their Annotations

Machine learning (ML) systems are data-hungry applications, and they like their data well prepared for training and inference. Although it may sound obvious, it is more important to scrutinize the properties of data than to select an algorithm to process the data. The data, however, can come in many different formats and can be from different sources. We can consider data in its raw format – for example, a text document or an image file. We can also consider data in a format that is specific to a task at hand – for example, tokenized text (where words are divided into tokens) or an image with bounding boxes (where objects are identified and enclosed in rectangles).

When considering ...

Get Machine Learning Infrastructure and Best Practices for Software Engineers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.