Reading semi-structured files

The simplest files for reading are those where all rows follow the same pattern: Each row has a fixed number of columns, and all columns have the same kind of data in every row. However, it is common to have files where the information does not have that format. On many occasions, the files have little or no structure. This is also called "semi-structured" formatting. Suppose you have a file with roller coaster descriptions, and the file looks like the following:

JOURNEY TO ATLANTIS SeaWorld Orlando Journey to Atlantis is a unique thrill ride since it is ... Roller Coaster Stats Drop: 60 feet Trains: 8 passenger boats Train Mfg: Mack KRAKEN SeaWorld Orlando Named after a legendary sea monster, Kraken is a ... Kraken ...

Get Pentaho Data Integration Cookbook Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.