Chapter 12. Scheduling and Monitoring

In this chapter, we take a closer look at the tools and techniques to run Kettle jobs and transformations in a production environment.

In virtually any realistic production environment, ETL and data integration tasks are run repeatedly at a fixed interval in time to maintain a steady feed of data to a data warehouse or application. The process of automating periodical execution of tasks is referred to as scheduling. Programs that are used to define and manage scheduling tasks are called schedulers. Scheduling and schedulers are the subject of the first part of this chapter.

Scheduling is just one aspect of running ETL tasks in a production environment. Additional measures must be taken to allow system administrators to quickly verify and, if necessary, diagnose and repair the data integration process. For example, there must be some form of notification to confirm whether automated execution has taken place. In addition, data must be gathered to measure how well the processes are executed. We refer to these activities as monitoring. We discuss different ways to monitor Kettle job and transformation execution in the second part of this chapter.

Scheduling

In this chapter, we examine two different types of schedulers for scheduling Kettle transformation jobs and transformations:

  • Operating system–level schedulers: Scheduling is not unique to ETL. It is such a general requirement that operating systems provide standard schedulers, such as cron on UNIX-like ...

Get Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.