Chapter 6. Data Extraction

The first step in an ETL process is getting data from one or more data sources. As we discussed in Chapters 1 and 5, this is a demanding task because of the complexity and variety of these different data sources. In a traditional data warehouse environment, data is usually extracted from an organization's transaction systems, such as financial applications or ERP systems. Most of these systems store their data in a relational database such as MySQL, Oracle, or SQL Server. As challenging as this may be from a functional point of view (we'll take a closer look at ERP systems later in this chapter), technically it's pretty straightforward to connect to a MySQL database using a JDBC driver and extract data from it. It gets more interesting when the database isn't relational and there's also no driver available to connect to it. In those situations you often end up having the data delivered in a flat file format such as a comma-separated ASCII file. An even trickier variation to this topic is data that is owned by someone else and is stored outside the corporate firewall, perhaps by a client or vendor company. In that case, a direct connection is usually not feasible so getting flat files might be the only option. In the case of data stored on the Internet, even flat files are not an option. (Imagine yourself calling Google, asking the company to FTP you some data set on a regular basis.) As you'll see later in this chapter, Kettle provides several ways to ...

Get Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.