Chapter 4. Working with Data and Feature Stores
Machine learning takes data and turns it into predictive logic. Data is essential to the process, can come from many sources, and must be processed to make it usable. Therefore, data management and processing are the most critical components of machine learning. Data can originate from different sources:
- Files
-
Data stored in local or cloud files
- Data warehouses
-
Databases hosting historical data transactions
- Online databases
-
SQL, NoSQL, graph, or time series databases hosting up to date transactional or application data
- Data streams
-
Intermediate storage hosting real-time events and messages (for passing data reliably between services)
- Online services
-
Any cloud service that can provide valuable data (this can include social, financial, government, and news services)
- Incoming messages
-
Asynchronous messages and notifications, which can arrive through emails or any other messaging services (Slack, WhatsApp, Teams)
Source data is processed and stored as features for use in model training and model flows. In many cases, features are stored in two storage systems: one for batch access (training, batch prediction, and so on) and one for online retrieval (for real-time or online serving). As a result, there may be two separate data processing pipelines, one using batch processing and the other using real-time (stream) processing.
The data sources and processing logic will likely change over time, resulting in changes to the processed ...