Chapter 16. Parallelization, Clustering, and Partitioning

When you have a lot of data to process it's important to be able to use all the computing resources available to you. Whether you have a single personal computer or hundreds of large servers at your disposal you want to make Kettle use all available resources to get results in an acceptable timeframe.

In this chapter, we unravel the secrets behind making your transformations and jobs scale up and out. Scaling up is using the most of a single server with multiple CPU cores. Scaling out is using the resources of multiple machines and have them operate in parallel. Both these approaches are part of ETL subsystem #31, the Parallelizing/Pipelining System.

The first part of this chapter deals with the parallelism inside a transformation and the various ways to make use of it to make it scale up. Then we explain how to make your transformations scale out on a cluster of slave servers.

Finally we cover the finer points of Kettle partitioning and how it can help you parallelize your work even further.

Multi-Threading

In Chapter 2, we explained that the basic building block of a transformation is the step. We also explained that each step is executed in parallel. Now we'll go a bit deeper into this subject by explaining how the Kettle multi-threading capabilities allow you to take full advantage of all the processing resources in your machine to scale up a transformation.

By default, each step in a transformation is executed in parallel ...

Get Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.