Chapter 9. Techniques for Predictive Analytics in Production
An overarching theme throughout this book has been the accessibility of machine learning. Many powerful, well-understood techniques have been around for decades. What has changed in the past few years are parallel advances in software and hardware that led to the rise of distributed data processing systems.
Real-Time Event Processing
The definition of real time, in terms of specifying a time window, varies dramatically by industry and application. However, a few design principles can improve predictive analytics performance in a wide variety of applications.
Designing a data processing system is a process of deciding when and where computation will happen. In general, all data requires some degree of processing before it can be analyzed. System architects must decide what processing happens at which stage of the data pipeline. At a high level, it is a decision between preprocessing data, which requires more time at the outset but makes data easier to query, versus simply capturing and storing data in its arrival format and doing additional processing at query time.
Structuring Semi-Structured Data
For example, suppose that you are tracking user behavior on an ecommerce website. Most of the information, such as event data, will arrive in a semi-structured format with information like user ID, page ID, and timestamp. In fact, there are probably several different types of events: product page view, product search, customer ...
Get The Path to Predictive Analytics and Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.