Chapter 4. Loading from Fixed-Width Files
Fixed-width data is one of the two overarching classifications of data that you deal with when using SQL*Loader. Fixed-width data files consist of columnar data—each field has a specific length and position that never varies from one record to the next. The other major classification, delimited data, discussed in Chapter 5, consists of variable-length fields separated by delimiter characters. Describing the fields in a fixed-width record is mostly a matter of specifying their positions and their datatypes. Occasionally, things are actually that simple. More often, you’ll encounter issues such as these:
How do you represent null values?
What do you do with records that are shorter than they should be?
How do you handle trailing whitespace at the end of a record?
The issue of null values is probably the trickiest of these issues. Null refers to the absence of a value, yet in a fixed-width data file you always have some value for each field. Other issues occur, perhaps less frequently. At times, you may be faced with the need to concatenate multiple physical records into one long logical record. You may also find yourself faced with fixed-width fields that contain delimited data.
This chapter shows you how to define fixed-width data, and it describes the tools SQL*Loader provides that allow you to deal with the issues listed here. You’ll learn how to define nulls, deal with short records and whitespace, concatenate multiple records together, and pull apart delimited, fixed-width fields.
Common Datatypes Encountered
COBOL programs almost universally generate fixed-width data, and they may in fact represent the greatest source of such data. We have certainly written enough COBOL programs over the years that transfer data between systems using flat files. If you are loading data that comes from a COBOL source, you’ll most likely be working with the following SQL*Loader datatypes:
CHAR |
DATE |
DECIMAL |
ZONED |
Tip
If you need a refresher on SQL*Loader datatypes, refer back to Chapter 3.
Of course, COBOL doesn’t have a lock on fixed-width data. Binary files containing data in hardware-specific numeric formats represent another source of fixed-width data. Such files could be generated by a C program, and might include the following SQL*Loader datatypes:
INTEGER |
SMALLINT |
FLOAT |
DOUBLE |
BYTEINT |
Character data can appear within a binary file, and in the case of C-style strings will be delimited by X’00’ characters. This can lead to some interesting complications in terms of blending fixed-width and delimited data.
Specifying Field Positions
When dealing with columnar data, each field has a starting position, an ending position, and a length. As in Chapter 3, you can specify these values for each field, or you can specify some values and allow SQL*Loader to calculate others. For each field, you can choose to specify one of the following combinations of values:
Starting and ending position
Starting position and length
Field type and length
Typically it’s best to be consistent and use one approach for all fields. This is especially true when you are supplying the field type and length for each field. When you do that, you’re using relative positioning—the beginning position of each field depends on the ending position of the previous field, and it rarely makes sense to mix relative positioning with absolute positioning.
We are going to show some examples of each method, but before we do that, let’s look at some representative columnar data:
MI Tie Hill Lookout Tower tower Alger MI Slowfoot Lake lake Alger MI Stella, Lake 790 lake Alger MI Dam Lake lake Alger MI Camp K Lake lake Alger MI Tie Lake 790 lake Alger MI Mathias Township Park 862 park Alger MI Trenary Cemetery 878 cemetery Alger MI Temple Lake lake Alger MI Mantila Camp 968 locale Alger MI Forest Lake Lookout Tower 1036 tower Alger ...
This is a list of geographic feature names for Alger County, Michigan. The columns are as follows:
State code
Feature name
Elevation above sea level
Feature type
County name
This is typical columnar data, and the next three sections will illustrate three different techniques for specifying the location of each column in the record. The destination table, used for all the examples in this chapter, is created as follows:
CREATE TABLE michigan_features ( feature_name VARCHAR2(44), elevation NUMBER, feature_type VARCHAR2(10), county VARCHAR2(5), short_feature_name VARCHAR2(10), latitude VARCHAR2(15), longitude VARCHAR2(15) );
The short feature name that you see in the table doesn’t correspond directly to a field in the input record. It’s not used for the first example, but later you’ll see how it’s derived from the potentially longer feature name field.
Starting and Ending Position
Probably the most obvious method of describing columnar data is to simply specify the starting and ending position of each field in the record. That’s exactly what the following LOAD statement does:
LOAD DATA INFILE 'alger_county.dat' REPLACE INTO TABLE michigan_features ( feature_name POSITION(4:47) CHAR, elevation POSITION(49:52) INTEGER EXTERNAL, feature_type POSITION(54:62) CHAR, county POSITION(64:68) CHAR )
Here you see the POSITION clause used to specify the exact starting and ending location for each field. We didn’t want to load the state code field, so we simply didn’t list it in our INTO TABLE clause. This is one advantage of dealing with columnar data—you can simply ignore fields that you don’t want to load.
Starting Position and Length
Another approach to describing columnar data is to specify the starting position and length of each record. Specify the starting position in the POSITION clause, and the length along with the datatype. For example:
LOAD DATA INFILE 'alger_county.dat' REPLACE INTO TABLE michigan_features ( feature_name POSITION(4) CHAR(44), elevation POSITION(49) INTEGER EXTERNAL(4), feature_type POSITION(54) CHAR(9), county POSITION(64) CHAR(5) )
Whether you use this method or the previous method should depend on the documentation and tools available to you. If you have an editor that makes it easy to determine starting and ending column positions for each field, you might choose to use that method. On the other hand, if your documentation specifies starting position and length, go with that. Use whichever method is most convenient.
Field Type and Length
The previous two methods specify the location of each field in absolute terms—the location of one field is never dependent on another. By specifying field type and length, on the other hand, you can describe columnar data in relative terms, with the starting position of each field depending on the ending position of the previous field. For example:
LOAD DATA INFILE 'alger_county.dat' REPLACE INTO TABLE michigan_features ( state_code FILLER CHAR(2), blank_space_1 FILLER CHAR(1), feature_name CHAR(44), blank_space_2 FILLER CHAR(1), elevation INTEGER EXTERNAL(4), blank_space_3 FILLER CHAR(1), feature_type CHAR(9), blank_space_4 FILLER CHAR(1), county CHAR(5) )
Two things are worth noticing about this LOAD statement:
The starting position is never specified for any field.
There are four fields named blank_space_1 through blank_space_4.
What’s going on here? First off, because each field specification contains only a datatype and a length, you know that relative positioning is being used. SQL*Loader will start with the first byte in the record, and will expect to find a two-character state_code field. SQL*Loader then expects to find the second field immediately following the first, and so forth. This leads to the need for several FILLER fields.
The feature name data used for this chapter has one blank between each field in the record. That blank provides visual separation of the data, but is not part of any field, and is nothing that you want to load into your database. If you were writing a COBOL record description, you would use the special name of FILLER for those blank fields. For example:
01 FEATURE-NAME-REC. 05 STATE-CODE PIC X(2). 05 FILLER PIC X. 05 FEATURE-NAME PIC X(44). 05 FILLER PIC X. 05 ELEVATION PIC 9(4). 05 FILLER PIC X. 05 FEATURE-TYPE PIC X(9). 05 FILLER PIC X. 05 COUNTY PIC X(5).
As mentioned earlier, SQL*Loader borrows the keyword FILLER from COBOL, but rather than have you use it as a field name, it expects you to use it as a field qualifier. SQL*Loader does not load FILLER fields into the database. The reason we needed to specify FILLER fields in this example was because we were describing the position of the data in relative terms; the position of each field is relative to the position of the preceding field. Thus, it’s necessary to account for every byte in the record—even those that we don’t want to load.
Tip
The FILLER keyword was introduced in Oracle8i. You can’t use it with prior releases of SQL*Loader.
Relative positioning is useful if you’ve been handed a data file along with a COBOL record format. You can easily use the COBOL PICTURE clauses as the basis for writing a SQL*Loader field list. Chapter 3 discusses this, and also discusses some of the advantages and disadvantages of relative positioning versus absolute positioning.
Handling Anomalous Data
When you’re dealing with columnar data, there are some anomalies and ambiguous situations that you need to deal with. Take a close look at the following data:
MI Sitka Lake 902 lake Alger MI Sevenmile Lake lake MI Little Beaver Creek stream Alger MI Traunik 925 ppl
Now, given this data, ask yourself these questions:
Should the missing elevation values be represented in the database as zeros or as nulls?
What should be done with records, such as those missing the county name, that are too short?
What should be done about trailing spaces in fields? Is it “Traunik” (with no trailing spaces) or “Traunik . . .” (with 37 trailing spaces to fill out the 44-character string)?
The answers to these questions are not always as obvious as you might think. While you almost certainly want a missing elevation to be represented as a null, what if you are dealing with a missing dollar amount? Should that become a null or a zero? Application considerations sometimes trump logic. Representing missing elevations as zero may not make intuitive sense, but you may be dealing with an application that requires missing numbers to be represented as zeros. With character fields, you have the issue of trailing spaces. You would almost certainly trim trailing spaces from feature names, but what about code fields? We’ve seen programs that expect alphanumeric fields such as the feature_type field to be constant-width, even if that means including trailing spaces.
SQL*Loader has default behavior with respect to each one of these issues, and that default behavior is often reasonable and sufficient. However, SQL*Loader provides you with the flexibility to override the default behavior when necessary.
Trimming Whitespace
SQL*Loader’s default behavior when you are loading fixed-width, columnar data is to trim trailing whitespace from each field in the input record. Whitespace in this context is defined as space and tab characters. This trimming of whitespace is actually one of the first things SQL*Loader does after reading each record of data, and it’s consistent with how most people intuitively look at character strings when they are arranged in a column. SQL*Loader does, however, provide you with a way to preserve trailing whitespace in your fields if you need to do that.
The PRESERVE BLANKS clause
You can use the PRESERVE BLANKS clause in your control file to prevent SQL*Loader from trimming trailing whitespace from the fields that you are loading. The PRESERVE BLANKS clause immediately precedes your first INTO TABLE clause, and it applies to all fields that you are loading.
Here is some sample data where PRESERVE BLANKS can make a difference:
MI Chatham ppl Alger MI Shingleton 821 ppl Alger MI Rumely ppl Alger MI Sundell 1049 Alger
The feature names here are the names of four towns. SQL*Loader’s default behavior is to trim trailing whitespace. However, the following LOAD statement uses PRESERVE BLANKS to override that default behavior:
LOAD DATA INFILE 'alger_nulltest.dat' REPLACE PRESERVE BLANKS INTO TABLE michigan_features ( feature_name POSITION(4) CHAR(44), short_feature_name POSITION(4) CHAR(10), elevation POSITION(49) INTEGER EXTERNAL(4) DEFAULTIF elevation = BLANKS, feature_type POSITION(54) CHAR(9), county POSITION(64) CHAR(5) )
The following example is taken from SQL*Plus, and shows the results of loading these feature names using the PRESERVE BLANKS clause:
SQL>SELECT feature_name,LENGTH(feature_name)
2FROM michigan_features
FEATURE_NAME LENGTH(FEATURE_NAME) -------------------------------------------- -------------------- Chatham 44 Shingleton 44 Rumely 44 Sundell 44
All the names are 44 bytes in length. Most of those bytes, of course, are the trailing spaces that weren’t trimmed as a result of using PRESERVE BLANKS.
Tip
When using PRESERVE BLANKS, you’ll need to use DEFAULTIF or NULLIF with any INTEGER EXTERNAL, FLOAT EXTERNAL, DECIMAL EXTERNAL, and ZONED EXTERNAL FIELDS.
Use of PRESERVE BLANKS causes an unfortunate side effect if you are loading NUMERIC EXTERNAL fields. A string of blanks cannot be translated to a number. When SQL*Loader attempts to convert a NUMERIC EXTERNAL field consisting only of blanks to a numeric value, an error will occur, and SQL*Loader will write the record to the bad file. Blank trimming normally prevents this from being an issue, but when you disable blank trimming, you suddenly have this problem to worry about. One solution is to make use of the DEFAULTIF and NULLIF clauses described later in this chapter. DEFAULTIF was used in this example to drive the numeric elevation value to zero when that field contained all spaces.
Selective trimming of whitespace
It’s unfortunate that Oracle doesn’t provide any mechanism to preserve blanks on a field-by-field basis. It’s all or nothing. If you are loading data where you need to preserve blanks for some fields and not for others, you may be able to use built-in SQL functions to achieve the desired results. The following example shows a LOAD statement that will preserve blanks only for the feature_type field:
LOAD DATA INFILE 'alger_nulltest.dat' REPLACE PRESERVE BLANKS INTO TABLE michigan_features ( feature_name POSITION(4) CHAR(44) "RTRIM(:feature_name)", short_feature_name POSITION(4) CHAR(10) "RTRIM(:short_feature_name)", elevation POSITION(49) INTEGER EXTERNAL(4) DEFAULTIF elevation=BLANKS, feature_type POSITION(54) CHAR(9), county POSITION(64) CHAR(5) "RTRIM(:county)" )
With respect to whitespace, this LOAD statement works in a three-fold manner:
The double-quoted strings at the end of each field specification represent SQL expressions that are made part of the INSERT statement that SQL*Loader uses to insert the data into the database. You’ll read more about this feature in Chapter 8. Here, RTRIM has been applied to all the fields except two. The elevation field is numeric. Whitespace doesn’t matter if the field actually contains some digits—the trailing spaces won’t matter when the data is converted to a number, and the DEFAULTIF clause handles the case where the field consists entirely of blanks. The feature_type field is a character field. Because RTRIM has not been applied to feature_type, each occurrence of that field will be exactly nine bytes in length.
This use of RTRIM to selectively trim fields effectively pushes the work of trimming trailing spaces from SQL*Loader into the database. It has some other ramifications as well. RTRIM will only trim trailing space characters, while SQL*Loader trims both trailing space and trailing tab characters. That subtle difference may be important to you if your data contains tab characters. The use of SQL expressions also precludes your doing a direct path load. Direct path loads are often advantageous when loading large amounts of data, and are discussed in Chapter 10.
Dealing with Nulls
In a database, a null represents data that is missing or nonexistent. In a columnar data file, however, you usually have some value for every column. You end up with an impedance mismatch wherein you either don’t allow for null values at all, or you must designate a specific field value to represent a null. If you designate a specific field value to represent a null, what then do you do when you actually need to represent that particular value in the column?
Nulls are not fun, but SQL*Loader does provide some features to help you deal with them. These features include:
- The NULLIF clause
Allows you to specify one or more conditions under which a field should be interpreted as a null.
- The DEFAULTIF clause
Allows you to specify one or more conditions under which a field should take on either a null or a zero value depending on whether the field is a character field or a numeric field.
The NULLIF and DEFAULTIF clauses sound simple and straightforward on the surface, but Oracle’s implementation of these clauses is confusing and defies logic. Many a DBA, including ourselves, has ripped his hair out trying to make DEFAULTIF work, because it just doesn’t work the way you would expect after reading the Oracle manuals. Before getting into all of this, though, it’s important that you understand SQL*Loader’s default behavior with respect to whitespace and null values.
SQL*Loader’s default behavior
When you are loading columnar, character-based data, SQL*Loader’s default behavior is to recognize whitespace as representing a null value. Take a look at the following two rows of data:
MI Shingleton 821 ppl Alger MI Shingleton Alger
The first row contains five values. In the second row, two of those values will be interpreted as null because two fields are filled entirely with whitespace. Here is the process SQL*Loader follows by default for character-based fields:
SQL*Loader extracts a field from the record.
Trailing spaces are trimmed from the field.
SQL*Loader checks to see if the resulting field length is zero.
If a character field consists entirely of spaces, then after those spaces have been trimmed off, the resulting field length will be zero. SQL*Loader interprets zero-length character fields as nulls.
Tip
As far as SQL*Loader is concerned, NUMERIC EXTERNAL fields such as INTEGER EXTERNAL and DECIMAL EXTERNAL are simply another type of character field. Trailing whitespace is trimmed, and if the resulting field length is zero, the corresponding database column is set to null.
None of what we’ve said here applies when you are loading binary data, or when you are using nonportable datatypes. SQL*Loader only trims whitespace on character fields. Nonportable datatypes such as INTEGER or FLOAT represent binary data—whitespace won’t be trimmed, and the length will never be zero. The only way to derive null values from such fields is to use the NULLIF and DEFAULTIF clauses.
The effect of PRESERVE BLANKS
The use of PRESERVE BLANKS has some interesting, and possibly confusing, ramifications with respect to fields that consist entirely of whitespace. When you use PRESERVE BLANKS, SQL*Loader doesn’t trim trailing whitespace from fields. Because trailing whitespace is not trimmed, even a completely blank field will not have a length of zero. Consequently, blank fields will no longer be treated as nulls. Instead, all those blanks will be loaded into your database. That may or may not be the behavior you are after.
Another complication with the use of PRESERVE BLANKS is the manner in which NUMERIC EXTERNAL fields are converted to numeric values. Fields with digits, even when trailing spaces are present, will still convert properly to numbers. However, NUMERIC EXTERNAL fields that are completely blank will no longer convert to nulls—they will cause errors instead. SQL*Loader will attempt to convert a string of blanks to a number, that will result in an error, and the entire record will be written to the bad file.
When you use PRESERVE BLANKS, you can work around these complications using the NULLIF and DEFAULTIF clauses.
NULLIF
The NULLIF clause allows you to specify a condition under which a field should be interpreted as a null. NULLIF is part of the field specification clause. Consider the following data:
MI Chatham 0 ppl Alger MI Shingleton 821 ppl Alger MI Rumely 0 ppl Alger MI Sundell 1049 Alger
The third field, the elevation field, sometimes contains a zero. You can use the NULLIF clause, as shown in the following example, to translate zero elevation value to null:
LOAD DATA INFILE 'data04.dat' REPLACE INTO TABLE michigan_features ( feature_name POSITION(4) CHAR(44), short_feature_name POSITION(4) CHAR(10), elevation POSITION(49) INTEGER EXTERNAL(4) NULLIF elevation = ' 0', feature_type POSITION(54) CHAR(9), county POSITION(64) CHAR(5) )
Notice the three leading spaces in the constant used in the NULLIF clause. Those are present because SQL*Loader considers an INTEGER EXTERNAL value to be a character datatype. In the data file, you have a four-character long field with three leading spaces followed by a zero. That character string is what the NULLIF clause works against.
NULLIF can also be helpful if you are using PRESERVE BLANKS and you have data such as the following:
MI Chatham ppl Alger MI Shingleton 821 ppl Alger MI Rumely ppl Alger MI Sundell 1049 Alger
The elevation values for Chatham and Rumely are blank. Normally, these blank spaces are trimmed, resulting in an empty string, further resulting in a null value being stored in the destination database column. However, when you use PRESERVE BLANKS, SQL*Loader feeds all the spaces from the blank elevation fields into the TO_NUMBER function, which returns an error, causing the record to be rejected and written to the bad file. You can work around this by using NULLIF to specify that blank field values should be interpreted as nulls. For example:
LOAD DATA INFILE 'data04.dat' REPLACE INTO TABLE michigan_features ( feature_name POSITION(4) CHAR(44), short_feature_name POSITION(4) CHAR(10), elevation POSITION(49) INTEGER EXTERNAL(4) NULLIF elevation = ' ', feature_type POSITION(54) CHAR(9), county POSITION(64) CHAR(5) )
The field condition used with a NULLIF clause isn’t limited to only looking at the one field in question. You may reference any other field defined in the field list, so the nullability of one field may be dependent on another. In the following example, NULLIF is used to set the elevation to null whenever the feature type is “falls”:
... elevation POSITION(49) INTEGER EXTERNAL(4) NULLIF feature_type='falls', feature_type POSITION(54) CHAR(9), ...
Tip
Include trailing spaces in your constants if you are using the PRESERVE BLANKS clause. For example, when PRESERVE BLANKS is being used, you would use “falls " (four trailing spaces) rather than “falls” (no trailing spaces).
This ability to reference other fields from a NULLIF condition gives you the ability to use indicator fields. Assuming that you are in control of the data file format, you can design it such that each field that you want to load has a corresponding indicator field to indicate nullness. This is especially helpful when loading binary data. It’s also helpful when you load character-based data using PRESERVE BLANKS. The following example uses an indicator field to indicate when the FIPS state code value should be null:
... gfn_fips_state_code POSITION(151) INTEGER NULLIF gfn_fips_state_code_ind=0, gfn_fips_state_code_ind FILLER POSITION(155) INTEGER, ...
Such a scheme frees you from having to designate a special value in the field you are loading to represent null. With respect to this example, any integer value may be loaded into the state code field. If a null is desired, you simply need to set the indicator to zero.
With respect to character data, SQL*Loader evaluates the NULLIF clause after trimming whitespace from the field, and just prior to checking the field’s length. This has ramifications for how you refer to fields in your NULLIF conditions. Refer to a character field by name, and you get the trimmed value—unless you are also using PRESERVE BLANKS. Refer to a field by position, and you get the entire value, including any trailing spaces. Here are two examples that will function identically. Both assume that you are not using PRESERVE BLANKS:
... elevation POSITION(49) INTEGER EXTERNAL(4) NULLIF feature_type='falls', feature_type POSITION(54) CHAR(9), ... ... elevation POSITION(49) INTEGER EXTERNAL(4) NULLIF (54:62)='falls ', feature_type POSITION(54) CHAR(9), ...
In the first case, the NULLIF clause refers to the feature_type field by name. The result is the trimmed value, so it is compared to “falls”. In the second case, the NULLIF clause refers to the feature_type field by specifying its position within the record. The result is the untrimmed value, which includes trailing spaces, and which will be 9 bytes long. That value must be compared to a 9-byte string, which also must contain trailing spaces. Hence, “falls " is used in the second case.
DEFAULTIF
The DEFAULTIF clause is primarily for use with numeric fields, and it allows you to specify a condition under which SQL*Loader will set a numeric field’s value to zero. DEFAULTIF can be used with nonportable numeric datatypes such as INTEGER and FLOAT. It can also be used with any of the NUMERIC EXTERNAL types, such as INTEGER EXTERNAL and ZONED EXTERNAL.
DEFAULTIF is one of the most frustrating and unintuitive SQL*Loader features that we’ve ever worked with. An obvious use for it, you would think, would be to have NUMERIC EXTERNAL fields set to zero when they are all blanks. Normally those fields are set to null when they are blank. Many people, in fact try to do that by writing a clause such as the following:
... elevation POSITION(49) INTEGER EXTERNAL(4) DEFAULTIF elevation=blanks, ...
This certainly seems logical enough. If the elevation field is composed of blanks, then store a zero in the corresponding database column. The problem is that this completely logical and intuitive approach simply does not work. The reason it doesn’t work is that SQL*Loader determines the nullity of the field before checking the DEFAULTIF condition, and only checks the DEFAULTIF condition if the field is not null. Here’s the process SQL*Loader follows:
SQL*Loader extracts the field from the record.
SQL*Loader trims any trailing blanks (unless PRESERVE BLANKS was specified).
SQL*Loader checks the field length, and if it is zero, sets the field to null.
SQL*Loader checks the NULLIF clause, if one is present.
SQL*Loader checks the DEFAULTIF clause, if one is present.
The problem is that SQL*Loader may stop after step 3 or 4. If the field length is zero, SQL*Loader will set the field value to null and stop after step 3. If you specify a NULLIF condition, and if that condition evaluates to TRUE, SQL*Loader stops after step 4, also setting the field value to null. The DEFAULTIF condition is checked last, and only if the field contains a non-null value. So in the case where you are most likely to need its functionality, the DEFAULTIF clause is completely useless.
Like NULLIF, the DEFAULTIF condition can reference other fields in the record. If you’re using indicator fields, you could use an indicator field to determine whether or not the field you were loading should be set to zero. For example:
... gfn_fips_state_code POSITION(151) INTEGER DEFAULTIF gfn_fips_state_code_ind=0, gfn_fips_state_code_ind FILLER POSITION(155) INTEGER, ...
This solution is viable if you have control over the program creating the data file that you are loading. You can modify the program to generate the needed indicator values. On the other hand, if you can modify the program, you can presumably just modify it to write out zeros when necessary, and you won’t have to bother using indicators at all.
Dealing with “Short” Records
Short records are those that contain fewer bytes than you would expect from looking at a LOAD statement. The records used for examples in this chapter have all been 68 bytes long. You can derive that value by looking at the LOAD statements and working out the math on field positions and lengths. A short record, then, with respect to this chapter, is any record containing fewer than 68 bytes.
When you are loading columnar data, there are two cases you need to be aware of with respect to short records:
The record may end in the middle of a field.
One or more fields may have starting positions beyond the end of the record.
If the record ends in the middle of a field, that field’s value is based on whatever characters are present in the record for that field. The field’s length also corresponds to whatever characters are present. This has an interesting implication when you are loading character fields and using the PRESERVE BLANKS clause. When you use PRESERVE BLANKS, trailing blanks are normally preserved, and the length of the values loaded into the database is equal to the maximum field size. Load a CHAR(5) field using PRESERVE BLANKS, and every row loaded into your database will have a 5-byte long value. The one exception is—you guessed it—when the record is truncated in the midst of such a field. SQL*Loader won’t add the necessary spaces to bring the field length up to the maximum size.
The ramifications of the second case are a bit easier to describe than those of the first. If a field has a defined position and size (columnar data), and its starting position is beyond the end of the record, then SQL*Loader will set the field’s value to null.
Database Column Defaults
When you create database tables, you can define default values for any or all of the columns. For example, the following CREATE TABLE statement defines a default value for the elevation column:
CREATE TABLE michigan_features ( feature_name VARCHAR2(44), elevation NUMBER DEFAULT 0, feature_type VARCHAR2(10), county VARCHAR2(5), short_feature_name VARCHAR2(10), latitude VARCHAR2(15), longitude VARCHAR2(15) );
You might think that you could take advantage of such column defaults, especially when it comes to loading empty NUMERIC EXTERNAL columns. Simply allow SQL*Loader to evaluate them as nulls, and the database default value will be used, right? Unfortunately, you can’t do that. SQL*Loader always explicitly sets a value for each field that it loads, even if that value is null. In other words, SQL*Loader will do the equivalent of:
INSERT INTO michigan_features (feature_name, elevation, feature_type, county) VALUES ('Harbor of Refuge',NULL,'bay','Alger');
Because NULL is explicitly specified, in this case for the elevation, it will override any default. In short, SQL*Loader always overrides database default values for columns that you are loading. In addition, SQL*Loader’s DEFAULTIF clause has nothing whatsoever to do with default values you specify at the table level.
Concatenating Records
SQL*Loader has the capability to read multiple records from a data file and concatenate those records into one longer record. When you do this, the distinction between physical and logical records becomes important. The term physical record is always used to refer to the records read from the input data file. The term logical record refers to a record that SQL*Loader loads into the database. Usually, there’s a one-to-one correspondence between physical and logical records. When you concatenate records together, you are combining two or more physical records into one logical record. SQL*Loader supports two main ways of doing this:
You can specify a fixed number of physical records to concatenate into one logical record.
You can specify a continuation flag in the physical record that indicates whether or not a given physical record should be concatenated to another.
The distinction between physical and logical records is important when it comes to specifying the positions of fields (using the POSITION clause) in the input record. The POSITION clause always refers to byte positions in the logical record. In general, SQL*Loader always deals with the logical record. The only time you specify positions with respect to the physical record is when you are defining continuation fields.
CONCATENATE
Use the CONCATENATE clause if you always want to concatenate a specific number of physical records into one logical record. Consider the following data, which splits the information for each geographical feature across two physical records:
MI Halfmoon Lake 798 lake Alger MI Cleveland Cliffs Basin 779 reservoir Alger MI Doe Lake 798 lake Alger MI Limestone 912 ppl Alger MI Traunik 925 ppl Alger
In this case, you always want SQL*Loader to read two physical records, combine them into one logical record, and then load the data from that logical record. This is a job for CONCATENATE.
CONCATENATE syntax
The CONCATENATE clause immediately precedes the PRESERVE BLANKS clause. If you are not using PRESERVE BLANKS, then CONCATENATE precedes your INTO TABLE clause. The syntax is as follows:
CONCATENATE [(]record_count
[)]
Replace record_count in the syntax with an integer indicating the number of records that you want to concatenate together. The use of parentheses is optional.
CONCATENATE example
The following LOAD statement can be used to concatenate and load the data shown earlier. Each group of two physical records will be combined into one logical record. Notice that the field positions are with respect to the logical record that results after the concatenation takes place. Also notice that this example assumes that there are no trailing spaces at the end of the first record—the feature_type field immediately follows the elevation:
LOAD DATA INFILE 'alger_concatenate.dat' REPLACE CONCATENATE 2 INTO TABLE michigan_features ( feature_name POSITION(4:47) CHAR, elevation POSITION(49:52) INTEGER EXTERNAL, feature_type POSITION(53:61) CHAR, county POSITION(63:67) CHAR )
CONCATENATE works well if you always have a fixed number of records to combine together. If the number of records to concatenate is variable, you should use CONTINUEIF instead.
Warning
When concatenating character data such as shown in this chapter, watch out for the issue of trailing spaces. Many text editors trim these automatically when a file is saved. If you edit your data file with a text editor, this trimming could result in non-uniform physical record lengths. That will throw off the field positions when the records are concatenated together.
CONTINUEIF
The CONTINUEIF clause allows you to specify a field in the input record to serve as an indicator when two or more physical records should be concatenated together. You specify the indicator by its position in the physical record. It can be one byte in length, or it can be multiple bytes. The indicator may be one of these two types:
- Before-the-fact
A before-the-fact indicator (our term, not Oracle’s) is one that tells SQL*Loader to concatenate the next physical record to the current physical record.
- After-the-fact
An after-the-fact indicator is one that tells SQL*Loader to concatenate the current physical record with the previous one.
There’s actually a third possibility—a variation on before-the-fact—that allows you to specify a continuation indicator that always occurs at the end of each physical record. This is more useful when loading variable-length and delimited data, and we discuss it in Chapter 5.
CONTINUEIF syntax
The CONTINUEIF clause appears in the same location as CONCATENATE. The syntax is as follows:
CONTINUEIF {THIS | NEXT} [PRESERVE] [(]start
[{: | -}end
][)] {= | <>} {'string
' | X'hex_digits
'}
The elements in the syntax are as follows:
- THIS
Specifies that the continuation indicator is before-the-fact. Each physical record contains a flag indicating whether or not the next physical record should be concatenated to it.
- NEXT
Specifies that the continuation indicator is after-the-fact. Each physical record contains a flag indicating whether or not it should be concatenated to the previous physical record.
- PRESERVE
Causes SQL*Loader to preserve the continuation string as part of the record. This is only available in Oracle release 8.1.7 and higher.
- start
Specifies the starting byte position of the continuation indicator in the physical record.
- end
Specifies the ending byte position of the continuation indicator in the physical record.
- : | -
Separates the start value from the end value. You can use either a colon (:) or a hyphen (-).
- = | <>
The equal and not-equal operators. These are your two choices. When the condition you specify is true, concatenation occurs. When the condition is false, SQL*Loader stops concatenating physical records, and processes the logical record that it has.
- string
The text string that you want SQL*Loader to look for in the indicator field. You may enclose this value in either single or double quotes.
- hex_digits
A string of hexadecimal digits representing the value that you want SQL*Loader to look for in the indicator field.
One issue to be aware of is the manner in which SQL*Loader computes the length of the indicator field. If you specify both a starting and ending position, the length is determined from that. If you specify only a starting position, the indicator field length is based on the length of the value (string or hex_digits) that you ask SQL*Loader to look for. In either case, if the length of the indicator field does not match the length of the value that you specify, SQL*Loader pads the shorter of the two values in the following manner:
If you specify a text string, SQL*Loader pads the shorter of the two values with spaces.
If you specify a string of hexadecimal digits, SQL*Loader pads the shorter of the two values with X’00’ characters.
Another, very important thing to be aware of, is that SQL*Loader removes continuation characters from each physical record that it reads. This has implications for how you specify field positions in the logical record, and it also means that you can’t load the continuation data into your destination database tables. The examples in the following few sections illustrate the continuation strings being removed.
CONTINUEIF THIS
Use CONTINUEIF THIS if the continuation indicator in one physical record indicates whether the logical record continues on to the next physical record or ends at the current record. The data in the following example has the indicator in the first byte of the record. A value of “1” indicates that the record is continued. A value of “0” marks the final physical record making up a logical record:
1MI Halfmoon 0 Lake 798 lake Alger 1MI Cleveland 1 Cliffs Basin 779 0 reservoir Alger 0MI Doe Lake 798 lake Alger 1MI Limeston 0e 912 ppl Alger 0MI Traunik 925 ppl Alger
You can see in this example that some logical records span multiple physical records, while others don’t. The logical record for Cleveland Cliffs Basin spans three physical records. The logical records for Doe Lake and Traunik each consist of just one physical record. The remaining logical records each consist of two physical records. To load this data, you could use the following control file:
LOAD DATA INFILE 'alger_this.dat' REPLACE CONTINUEIF THIS (1) = '1' INTO TABLE michigan_features ( feature_name POSITION(4:47) CHAR, elevation POSITION(49:52) INTEGER EXTERNAL, feature_type POSITION(54:62) CHAR, county POSITION(64:68) CHAR )
The CONTINUEIF clause in this case specifies that physical records should be concatenated together as long as the first byte contains a “1”. Concatenation stops when that’s no longer the case, so the “0” records in this example end up marking the end of each logical record.
SQL*Loader only looks specifically for the value that you specify. In this example, SQL*Loader looks for a “1”. Any other value indicates the final record. We used a “0” uniformly for that purpose, but any value other than a “1” would serve equally well to terminate the logical record.
Remember that SQL*Loader removes the continuation characters from each physical record when assembling the logical record. Thus, the logical record for Cleveland Cliffs Basin would be assembled to appear as follows:
MI Cleveland Cliffs Basin 779 reservoir Alger
Notice that all the “1"s and “0"s used for continuation are gone. Be sure to take this into account when computing field positions for the fields that you are loading.
CONTINUEIF NEXT
Use CONTINUEIF NEXT if your continuation indicator is in the second and subsequent records. In the following example, the first physical record for each logical record begins with four spaces. Continuation records are flagged with the text CONT:
MI Halfmoon CONT Lake 798 lake Alger MI Cleveland CONT Cliffs Basin 779 CONT reservoir Alger MI Doe Lake 798 lake Alger MI Limeston CONTe 912 ppl Alger MI Traunik 925 ppl Alger
The following control file could be used to load this data:
LOAD DATA INFILE 'alger_next.dat' REPLACE CONTINUEIF NEXT (1:4) = 'CONT' INTO TABLE michigan_features ( feature_name POSITION(4:47) CHAR, elevation POSITION(49:52) INTEGER EXTERNAL, feature_type POSITION(54:62) CHAR, county POSITION(64:68) CHAR )
Because CONTINUEIF NEXT is used, each time SQL*Loader reads a physical record beginning with CONT, it concatenates that record to the current logical record. A physical record beginning with anything other than CONT (spaces in this example) marks the beginning of a new logical record.
The continuation indicator in this example is a string of four characters, and those are still removed when SQL*Loader builds the logical record. The logical record for Cleveland Cliffs Basin still appears as follows:
MI Cleveland Cliffs Basin 779 reservoir Alger
Using not-equals
The use of not-equals (< >) in a CONTINUEIF expression can be somewhat confusing and unintuitive. We always need to think things through two or three times whenever we use it. Here’s an example of some data where the use of not-equals is indicated:
aMI Halfmoon 0 Lake 798 lake Alger bMI Cleveland c Cliffs Basin 779 0 reservoir Alger 0MI Doe Lake 798 lake Alger fMI Limeston 0e 912 ppl Alger 0MI Traunik 925 ppl Alger
You’ll notice here that a “0” in the first position marks the last physical record for each logical record, but that the first character otherwise takes on a number of different values. What you want to do here is to concatenate records so long as the indicator value is not “0”. The CONTINUEIF clause in the following LOAD statement does this:
LOAD DATA INFILE 'alger_not_equal.dat' REPLACE CONTINUEIF THIS (1) <> '0' INTO TABLE michigan_features ( feature_name POSITION(4:47) CHAR, elevation POSITION(49:52) INTEGER EXTERNAL, feature_type POSITION(54:62) CHAR, county POSITION(64:68) CHAR )
In actual practice, it’s almost always possible to use the equality operator with CONTINUEIF. Only one or two times in our careers have we ever been handed data that forced us to use not-equals.
Nesting Delimited Fields
With columnar data, it’s possible to take an individual field and treat it as if it were a series of delimited fields. In essence, depending on the representation used, you may be able to extract subfields. There are also some ways to deal with records that begin with fixed-width columns only to have a variable-length column thrown into the middle.
Extracting Subfields
If you’ve ever worked on a MUMPS system,[1] you are familiar with the concept of nested delimiters, and of using the $PIECE function to extract specific pieces of information from a delimited string. The following columnar data is a doctored version of the feature name data that you’ve seen so far in this chapter. It’s not from a MUMPS system, but the records do contain one column with nested, delimited data:
MI Werners Creek "46° 11' 07'' N,86° 58' 48'' W " stream Alger MI Chapel Falls "46° 14' 44'' N,86° 26' 40'' W " falls Alger MI Chapel Lake "46° 32' N,86° 26' 40'' W " lake Alger MI Birch Lakes "46° 33' 13'' N,86° 9' 26'' W " lake Alger MI Chapel Beach Campground "46° 32' 51'' N,86° 26' 29'' W " locale Alger MI Legion Lake "46° 31' 41'' N,186° 21' 47'' W" lake Alger MI Chapel Creek "46° 32' 54'' N,86° 26' 21'' W " stream Alger MI Nita, Lake "46° 33' N,86° 3' 53'' W " lake Alger MI West Branch Lakes "46° 30' 47'' N,86° 6' 04'' W " lake Alger MI Centerline Lake "46° 30' 23'' N,86° 3' 50'' W " lake Alger
In this example, the latitude and longitude values for each feature are enclosed in a 30-character wide quoted string. A comma ( , ) serves to separate the two values, and the conventional degree, minute, and second notation is used. For example:
degree° minute' second''
The next two examples show you how to do two things. First, you’ll see how to store the latitude and longitude into separate database columns. Then you’ll see how to extract each degree, minute, and second value individually.
Separating the latitude and longitude
The key to separating the latitude and longitude in the data shown earlier is the comma used to delimit the two values. You can recognize the end of the latitude value by the comma. You can’t, however, recognize the end of the longitude value in the same way—there’s no second comma. Instead, you can take advantage of the trailing double quote. Use the following LOAD statement:
LOAD DATA INFILE 'alger_nested.dat' REPLACE INTO TABLE michigan_features ( feature_name POSITION(4:26) CHAR, latitude POSITION(29) CHAR TERMINATED BY ',', longitude CHAR TERMINATED BY '"' "RTRIM(:longitude)", feature_type POSITION(61:66) CHAR, county POSITION(68:72) CHAR )
Here are some points you should note about this LOAD statement:
The latitude field always begins at byte 29, but the end of the field is determined by the location of the comma.
No POSITION clause is used for the longitude field, so it will begin immediately after the comma that marks the end of the previous field.
The longitude field is terminated by a double quote, which leaves open the possibility that it may contain trailing spaces. Oracle’s built-in RTRIM function is used to remove those.
If you’re not familiar with the DELIMITED BY clause, read Chapter 5. The use of built-in functions such as RTRIM is discussed in Chapter 8.
Separating the individual latitude and longitude components
The previous example demonstrated one way to separate the latitude and longitude values into two separate, character-based database columns. Consider, for a moment, the problem of loading that same data into the following table:
CREATE TABLE michigan_features ( feature_name VARCHAR2(44), elevation NUMBER, feature_type VARCHAR2(10), county VARCHAR2(5), short_feature_name VARCHAR2(10), lat_degrees NUMBER, lat_minutes NUMBER, lat_seconds NUMBER, lat_direction CHAR, lon_degrees NUMBER, lon_minutes NUMBER, lon_seconds NUMBER, lon_direction CHAR );
This table requires that each individual latitude and longitude component be split out into a separate column. You can actually do this with SQL*Loader by creatively making use of the degree, minute, and second symbols as delimiters. The following, much more complex LOAD statement extracts the individual degree, minute, second, and direction components into the separate database columns:
LOAD DATA INFILE 'alger_nested.dat' REPLACE INTO TABLE michigan_features ( feature_name POSITION(4:26) CHAR, lat_degrees POSITION(29) INTEGER EXTERNAL TERMINATED BY '°', lat_minutes INTEGER EXTERNAL TERMINATED BY "'", lat_seconds INTEGER EXTERNAL TERMINATED BY "''", blank_space_1 FILLER CHAR(1), lat_direction CHAR(1) TERMINATED BY ',', lon_degrees INTEGER EXTERNAL TERMINATED BY '°', lon_minutes INTEGER EXTERNAL TERMINATED BY "'", lon_seconds INTEGER EXTERNAL TERMINATED BY "''", blank_space_2 FILLER CHAR(1), lon_direction CHAR(1), feature_type POSITION(61:66) CHAR, county POSITION(68:72) CHAR )
In this example the degree, minute, and second values are treated as INTEGER EXTERNAL. You would want to do that if your destination database columns for these values were numeric. Here are some other things you should note about this solution:
POSITION(29) is used to position SQL*Loader to the lat_degrees field. From that point forward, SQL*Loader progresses from one delimiter to the next.
The numeric degree, minute, and second values are delimited by the degree symbol ( ° ), the minute symbol ( ' ), and the second symbol ( ' ' ) respectively. Since the minute and second symbols are composed of one and two apostrophes, respectively, they are enclosed within double quotes.
The blank_space_1 and blank_space_2 fields are single character filler fields that absorb the space between the value for seconds and the letter indicating the compass direction.
The lat_direction field is terminated by a comma ( , ), which causes SQL*Loader to look past the comma for the start of the next field.
The lon_direction field is simply CHAR(1), with no terminating character. That’s sufficient to pick up the one letter.
The POSITION clause is used to specify the start location for the feature_type field that follows all the delimited latitude and longitude data.
The manner in which we’ve handled the two direction fields deserves some more explanation. Instead of preceding each by a “blank space” column, we could have written:
lat_direction CHAR(2) TERMINATED BY ',' "LTRIM(:lat_direction)", ... lon_direction CHAR(2) "LTRIM(:lon_direction)",
In this way, Oracle’s built-in LTRIM function would have eliminated the leading space character. The disadvantage of using LTRIM is that using any SQL function precludes you from doing a direct path load (except in Oracle9i ). In many situations, but not all, direct path loads confer significant performance advantages.
Variable-Length Records
SQL*Loader can even deal with records that begin with fixed-width data in a columnar format, but that later include one or more variable-length fields. The following LOAD statement defines the feature_name field as a C-style string terminated by an X’00’ character:
LOAD DATA INFILE 'alger.dat' REPLACE INTO TABLE michigan_features ( elevation POSITION(4) INTEGER EXTERNAL(4), feature_name POSITION(8) CHAR(44) TERMINATED BY X'00', feature_type CHAR(9), county CHAR(5) )
The starting positions of the first two columns are fixed, so the POSITION clause is used for each. The feature name, however, is terminated by an X’00’ character, and may or may not be a full 44 characters in length. Because of that, the definitions for the fields that follow all omit the POSITION clause. SQL*Loader will look for them immediately following the X’00’ character used to terminate the feature name, and thus their positions will tend to change with each record that SQL*Loader processes. In this example, only the feature_name field is variable-length. The feature_type and county fields are still fixed in size. All the fields can be variable-length delimited fields, and the next chapter shows you how to deal with those.
[1] MUMPS is a programming language sometimes used in medical and banking systems. In MUMPS, all data is character data, and fields in a record are separated by delimiters.
Get Oracle SQL*Loader: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.