4

Data Acquisition, Data Quality, and Noise

Data for machine learning systems can come directly from humans and software systems – usually called source systems. Where the data comes from has implications regarding what it looks like, what kind of quality it has, and how to process it.

The data that originates from humans is usually noisier than data that originates from software systems. We, as humans, are known for small inconsistencies and we can also understand things inconsistently. For example, the same defect reported by two different people could have a very different description; the same is true for requirements, designs, and source code.

The data that originates from software systems is often more consistent and contains less noise ...

Get Machine Learning Infrastructure and Best Practices for Software Engineers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.