BUY THIS BOOK
Add to Cart

Print Book $44.99


Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £31.99

What is this?

Looking to Reprint or License this content?

The Art of SQL
The Art of SQL

By Stéphane Faroult, Peter Robson
Book Price: $44.99 USD
£31.99 GBP
PDF Price: $31.99

Cover | Table of Contents


Table of Contents

Chapter 1: Laying Plans
Designing Databases for Performance
Joseph de Maistre (1754-1821)
Lettre du 27 Juillet 1812
à Monsieur le Comte de Front
C'est le premier pas qui, dans toutes les guerres, décèle le génie.
It is the first step that reveals genius in all wars.
The great nineteenth century German strategist, Clausewitz, famously remarked that war is the continuation of politics by other means. Likewise, any computer program is, in one way or another, the continuation of the general activity within an organization, allowing it to do more, faster, better, or cheaper. The main purpose of a computer program is not simply to extract data from a database and then to process it, but to extract and process data for some particular goal. The means are not the end.
A reminder that the goal of a given computer program is first of all to meet some business requirement may come across as a platitude. In practice, the excitement of technological challenges often slowly causes attention to drift from the end to the means, from upholding the quality of the data that records business activity to writing programs that perform as intended and in an acceptable amount of time. Like a general in command of his army at the beginning of a campaign, we must know clearly what our objectives are—and we must stick to them, even if unexpected difficulties or opportunities make us alter the original plan. Whenever the SQL language is involved, we are fighting to keep a faithful and consistent record of business activity over time. Both faithfulness and consistency are primarily associated with the quality of the database model. The database model that SQL was initially designed to support is the relational model . One cannot overemphasize the importance of having a good model and a proper database design, because this is the very foundation of any information system.
A database is nothing but a model of a small part of a real-life situation. As any representation, a database is always an imperfect model, and a very narrow depiction of a rich and complex reality. There is rarely a single way to represent some business activity, but rather several variants that in a technical sense will be semantically correct. However, for a given set of processes to apply, there is usually one representation that best meets the business requirement.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Relational View of Data
A database is nothing but a model of a small part of a real-life situation. As any representation, a database is always an imperfect model, and a very narrow depiction of a rich and complex reality. There is rarely a single way to represent some business activity, but rather several variants that in a technical sense will be semantically correct. However, for a given set of processes to apply, there is usually one representation that best meets the business requirement.
The relational model is thus named, not because you can relate tables to one another (a popular misconception), but as a reference to the relationships between the columns in a table. These are the relationships that give the model its name; in other words, relational means that if several values belong to the same row in a table, they are related. The way columns are related to each other defines a relation, and a relation is a table (more exactly, a table represents one relation).
The business requirements determine the scope of the real-world situation that is to be modeled. Once you have defined the scope, you can proceed to identify the data that you need to properly record business activity. If we say that you are a used car dealer and want to model the cars you have for sale (for instance to advertise them on a web site), items such as make, model, version, style (sedan, coupe, convertible...), year, mileage, and price may be the very first pieces of information that come to mind. But potential buyers may want to learn about many more characteristics to be able to make an informed choice before settling for one particular car. For instance:
  • General state of the vehicle (even if we don't expect anything but "excellent")
  • Safety equipment
  • Manual or automatic transmission
  • Color (body and interiors), metallic paintwork or not, upholstery, hard or soft top, perhaps a picture of the car
  • Seating capacity, trunk capacity, number of doors
  • Power steering, air conditioning, audio equipment
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Importance of Being Normal
Normalization, and especially that which progresses to the third normal form (3NF), is a part of relational theory that most students in computer science have been told about. It is like so many things learned at school (classical literature springs to mind), often remembered as dusty, boring, and totally disconnected from today's reality. Many years later, it is rediscovered with fresh eyes and in light of experience, with an understanding that the essence of both principles and classicism is timelessness.
The principle of normalization is the application of logical rigor to the assemblage of items of data—which may then become structured information. This rigor is expressed in the definition of various normal forms, most typically three, although purists argue that one should analyze data beyond 3NF to what is known in the trade as Boyce-Codd normal form (BCNF), or even to fifth normal form (5NF). Don't panic. We will discuss only the first three forms. In the vast majority of cases, a database modeled in 3NF will also be in BCNF and 5NF.
You may wonder why normalization matters. Normalization is applying order to chaos. After the battle, mistakes may appear obvious, and successful moves sometimes look like nothing other than common sense. Likewise, after normalization the structures of the various tables in the database may look natural, and the normalization rules are sometimes dismissively considered as glorified common sense. We all want to believe we have an ample supply of common sense; but it's easy to get confused when dealing with complex data. The three first normal forms are based on the application of strict logic and are a useful sanity checklist.
The odds that our creating un-normalized tables will increase our risk of being struck by divine lightning and reduced to a little mound of ashes are indeed very low (or so I believe; it's an untested theory). Data inconsistency, the difficulty of coding data-entry controls, and error management in what become bloated application programs are real risks, as well as poor performance and the inability to make the model evolve. These risks have a very high probability of occurring if we don't adhere to normal form, and I will soon show why.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
To Be or Not to Be, or to Be Null
A very common modeling mistake is to associate large numbers of possible characteristics within a relation, which may result in a table with a large number of columns. Some scientific disciplines may require a very detailed characterization of objects under study, and thus require a large number of attributes, but this is rarely the case in business applications. In any case, a sure sign that a database design is flawed is when columns of some prominent tables mostly contain null values , and especially when two columns cannot possibly contain a value at the same time; if one is defined, the other must be null, and vice versa. This condition would undoubtedly indicate a violation of either 2NF or 3NF.
If we admit that a row in a table represents a statement about the characteristics of a given "thing," indicating that "we don't know" for most characteristics seriously downgrades the table as a source of reliable information. This may be a minor inconvenience if the data is stored for informative purpose only. It becomes a major issue if the unknown values are supposed to help us define a result set, and this state of affairs is indicative of a flawed model. All columns in a row should ultimately contain a value, even if business processes are such that various pieces of information are entered from more than one source and/or at different points in time. A stamp collector might likewise keep some room in an album for a series temporarily absent from the collection. But even so, there is a risk of wasting storage if it is actually reserved because one always tailors for the maximum size. There is also a risk of very serious performance problems if only placeholders are used and data goes to some remote overflow area when it is entered at last.
The existence of null values also raises an important point with regard to relational modeling, which is the main foundation for the query optimizer. The completeness of a relational model is founded on the application of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Qualifying Boolean Columns
Even though the Boolean type doesn't exist in SQL, many people feel a need to implement flags to indicate a Boolean true/false status (for instance order_completed). You should aim for increasing the density of your data--order_completed may be useful information to know, but then perhaps other information would be nice to store too: when was it completed? Who completed it? So that means that instead of having a single "Y/N" column, we can have a completion_date column, and perhaps a completed_by column, both of which will tell us more (although we may not necessarily want to see a null value as long as the order isn't completed; a solution may be to use a distinct table to track the various stages of every order from creation to completion). As before, examine the dependencies in the context of your business requirements, and only include those additional columns where the successful operation of the business requires it.
Alternatively, a series of essentially Boolean attributes can sometimes be advantageously combined into a unique status attribute. For instance, if you have four attributes that can be either true or false, you can assign a numerical value between 0 and 15 to each of the possible combinations and define the "status" as being represented by this value. But beware—this technique may offend the basic rule of atomicity, so if you must use this approach, do so with considerable caution.

Important

Data for data's sake is a path to disaster.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Understanding Subtypes
Another reason for the appearance of unnecessarily wide tables (as in having too many attributes) is a lack of understanding of the true relationship between data items. Consider the example of subtypes . A company may have a mix of employees, some of whom are permanent, others who are contractors. They all have several properties in common (name, year of birth, department, room, phone number, and so forth), but there are also properties that are unique to each type of employee (for instance, hire date and salary for permanent employees, rate and contract reference for contractors). The manner in which the common attributes can be shared, while ensuring that the distinctive features are kept separate, introduces the topic of subtypes.
We can model this situation by defining three tables. First, the employee table contains all information that is common to every employee, regardless of their status. However, an attribute tells the status of each employee. It has as many distinct values as there are distinct employee types, for example "P" (for permanent employee), and "C" (for contract employee). This table uses an employee number as the primary key.
Next, we create additional tables, one for each employee type. In this case, there are two tables. Tables permanent and contract represent subtypes of the table employee, for example. Each permanent or contract employee inherits certain characteristics from the employee table, in addition to possessing unique characteristics, as defined in their own tables.
Now let's examine the creation of the primary keys between these two types of tables, as it's the primary key construct that implements the subtype relationships . The unique key for all tables is the unique identifier for each member of staff—the employee number. The set of primary keys of employee is the union of the primary keys of the various subtype tables, and the intersection of the primary keys of all subtype tables is by construction empty, because each employee belongs to just one, in this case, of the two categories. The primary keys of subtype tables are also foreign keys, referencing the primary key of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Stating the Obvious
It is always an unsound situation in which there are implicit constraints on your data—for instance "if the business line is such, then the identifier is numeric (although defined as a string of characters to accommodate other business lines)," or "if the model is T, then the color is necessarily black." Sometimes, such general knowledge information can prove extremely efficient when filtering data. However, if it remains human knowledge, the DBMS engine, unaware of it, will be unable to take advantage of it, and the optimizer will not possess the necessary information to affect the most efficient database access. In the worst case, implicit constraints can even lead to a runtime failure. For instance, you might inadvertently require the database engine to apply an arithmetic process to a character string. This can happen when a character-defined column is used only for storing numeric data, and a non-numeric character slips in.
As an aside, the example of a string identifier that sometimes contains character data and sometimes numerical data illustrates a confusion over domain definitions in the initial database design. It is quite clear that the nature of such a field varies according to circumstances—which is totally unacceptable in a properly designed database. If we need to store, for instance, configuration parameters of various natures (numerical, Boolean, character, and so on), we should not store them in a single table configuration(parameter_name, parameter_value), but rather use a generic table configuration(parameter_id, parameter_name, parameter_type) and have as many subtypes as we have parameter types. If we use, for instance, configuration_numeric(parameter_id, parameter_value), where parameter_value is a numeric column, any mistyping of the letter "O" instead of zero will be detected by the DBMS when the configuration is changed, instead of resulting in a runtime error when the parameter is used.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Dangers of Excess Flexibility
As always, pushing a line of reasoning to the limits (and often past them) can result in a monument to human madness. A great favorite with third-party software editors is the "more-flexible-than-thou" construct, in which most data of interest is stored in some general purpose table, with equally general purpose attributes such as: entity_id, attribute_id, attribute_value. In this "design," everything is stored as a character string into attribute_value. The design certainly avoids the risk of having null values. However, the proponents of this type of design usually store the mandatory attributes in attribute_value as well. Their mantra, by the way, is usually that this design approach makes it easy to add new attributes whenever they are needed. Without commenting on the quality of a design that makes it necessary to anticipate the necessarily haphazard addition of attributes, let's just remark that it's all very nice to store data, but usually, somehow, one day you will have to retrieve and process that same data (if data retrieval is not being planned, there is something seriously wrong somewhere). Adding a column to a table really pales into insignificance when compared to writing a program to do something useful with the new bits of information that you are given to manage (as enthusiasts that praise the flexibility of the Extensible Markup Language [XML] are bound to understand).
The database cost of such pseudoflexibility rockets sky-high. Your database integrity is totally sacrificed, because you can hardly have a weaker way of typing your data. You cannot have any referential integrity. You cannot, in fact, have any type of declarative constraints. The simplest query becomes a monstrous join, in which the "value table" is joined 10, 15, or 20 times to the very same entity, depending on the number of attributes one wants to select. Needless to say, even the cleverest optimizer is at a loss on such a query, and performance is what one should expect—dismal. (You can try to improve the performance of such a query as described in Chapter 11, but the SQL code is not a pretty sight.) By comparison, the most inept campaign of military history looks like a masterpiece of strategic planning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Difficulties of Historical Data
Working with historical data is an extremely common condition—the process of valuation , or specifying the price of goods or a service at a particular point in time, is based on historical data—but one of the really difficult issues of relational design is the handling of data that is associated with some period (as opposed to point) of time.
There are several ways to model historical data. Let's assume that we want to record the successive prices of goods identified by some article_id. An obvious way to do so is to store the following items:
    (article_id, effective_from_date, price)
where effective_from_date is the date when the new price takes effect, and the primary key of the historical table is (article_id, effective_from_date).
Logically correct, this type of model is rather clumsy to use when working with current data, which in many cases will be our main concern. How are we going to identify the current value? It's the one associated with the highest effective_from_date, and it will be retrieved by running a query looking like:
    select a.article_name, h.price
    from articles a,
         price_history h
    where a.article_name = some_name
      and h.article_id = a.article_id
      and h.effective_from_date =
         (select max(b.effective_from_date)
          from price_history b
          where b.article_id = h.article_id)
Executing this query requires two passes over the same data: one in the inner query to identify which is the most recent date we have for a given article, and one in the outer query to return the price from a row that we have necessarily hit in the inner query (Chapter 6 talks about special functions implemented by some DBMS systems that can avoid, to some extent, multiple passes). Executing repeated queries following this pattern can prove very costly.
However, the choice of how to register the validity period for a price is arbitrary. Instead of storing the effective date from which the price applies, why not store the "end date" (e.g., the last date on which the current price prevails), identifying the time intervals by their upper bound instead of by their lower bound?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Design and Performance
It is flattering (and a bit frightening too) to performance specialists to see the faith in their talents devotedly manifested by some developers. But, at the risk of repeating myself, I must once again stress what I said in the introduction to this book: tuning is about getting the best possible performance, now. When we develop, we must have a different mindset and not think "let's code it, and then have a specialist tune it later in production." The impact of tuning on the structure of programs is usually nil, and on queries, often minimal once the big mistakes have been corrected. There are indeed two aspects to this matter:
  • One aspect of tuning is the improvement of the overall condition of the system, by setting some parameters in accordance with the current resources in terms of CPU power, memory available, and I/O subsystems, and sometimes taking advantage of the physical implementation of the DBMS. This is a highly technical task, which may indeed improve the performance of some processes by a significant factor, but rarely by more than 20 or 30 percent unless big mistakes were made.
  • The other aspect of tuning is the modification of specific queries, a practice that may, unfortunately, expose the limitations of the query optimizer and changes of behavior between successive DBMS releases.
That is all there is to it.
In my view, adding indexes doesn't really belong to the tuning of production databases (even if some tuning engagements are sometimes a matter of reviewing and correcting the indexing scheme for a database). Most indexes can and must be correctly defined from the outset as part of the designing process, and performance tests should resolve any ambiguous cases.
Performance is no more a question of making a couple of queries faster than war is a question of winning a couple of battles. You can win a battle and lose the war. You can tune your queries and nevertheless have an application with dismal performance that nobody will want to use, except at gunpoint. Your database and programs, as well as your SQL queries, must all be properly designed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processing Flow
Besides all the questions addressed earlier in this chapter, the operating mode is also a matter that may have significant impact on the working system. What I mean by operating mode is whether data should be processed asynchronously (as is the case with batch programs ) or synchronously (as in a typical transactional program).
Batch programs are the historical ancestors of all data processing and are still very much in use today even if no longer very fashionable; synchronous processing is rarely as necessary as one might think. However, the improvement of networks and the increase in bandwidth has led to the "global reach" of an increasing number of applications. As a result, shutting down your online transaction processing (OLTP) application running in the American Midwest may become difficult because of East Asian users connected during one part of the Midwestern night and European users connected during the other part. Batch programs can no longer assume that they are running on empty machines. Moreover, ever-increasing volumes of data may require that incoming data is processed immediately rather than being allowed to accumulate into unmanageably large data sets. Processing streams of data may simply be the most efficient way to manage such quantities.
The way you process data is not without influence on the way you "think" of your system, especially in terms of physical structures—which I talk about more in Chapter 5. When you have massive batch programs, you are mostly interested in throughput—raw efficiency, using as much of the hardware resources as possible. In terms of data processing, a batch program is in the realm of brute force. When you are processing data on the fly, most activity will be small queries that are going to be repeatedly executed a tremendous number of times. For such queries, performing moderately well is not good enough—they have to perform at the maximum possible efficiency. With an asynchronous program, it is easy to notice that something is wrong (if not always easy to fix): it just takes too long to complete. With synchronous processing, the situation is much more subtle, because performance problems usually show up at the worst moment, when there are surges of activity. If you are not able to spot weaknesses early enough, your system is likely to let you down when your business reaches maximum demand levels—the very worst time to fail.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Centralizing Your Data
For all the talk about grids, clustered servers, and the like, spreading data across many servers means adding a considerable amount of complexity to a system. The more complicated a structure—any type of structure—the less robust it is. Technological advance does indeed slowly push up the threshold of acceptability. In the eighteenth century, clocks indicating the minutes were considered much less reliable than those indicating only the hour, and much more reliable than those showing the day in the month or the phases of the moon. But nevertheless, try to keep the theater of operations limited to that which is strictly required.
Transparent references to remote data are performance killers, for two reasons. First, however "transparent" it may look, crossing more software layers and a network has a heavy cost. To convince yourself, just run a procedure that inserts a few thousands rows into a local table, and another one doing the very same thing across—for instance, an Oracle database link, even on the same database—you can expect performance to be in the neighborhood of five times slower, if not worse, as you see demonstrated in Chapter 8.
Second, combining data from several sources is extremely difficult. When comparing data from source A to data from source B, you have no choice other than literally copying the data from A to B or the reverse. Transfer is one significant overhead. Data drawn from its own carefully constructed environment no longer benefits from the planning which went into establishing that environment (carefully thought-out physical layout, indexes, and so forth). Instead, that data lands in some temporary storage—in memory if the amount of data transferred is modest, otherwise on disk. The management of temporary storage is another major overhead. In a case where nested loops would be, in theory, the most efficient way to proceed when querying local data, an optimizer is left with two unattractive possibilities when some of the data is remotely located:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
System Complexity
Other points to keep in mind when designing are what will happen if some piece of hardware breaks (for example, a disk controller) or if some mistake is made (for instance, the same batch program is applied twice). Even if your administrators are wizards who are doing night shifts to bring everything back on course by dawn, transfer rates are limited; the recovery of a huge database always takes a lot of time. "Spare" backup databases maintained in synch (or with some slight delay) may help. But backup databases will not be of any use in the case of a program inadvertently run twice, especially if the synchronization delay is shorter than the execution time of the program. What is already complicated with one database becomes a nightmare with several related databases, because you must be perfectly certain that all the databases are correctly synchronized after any recovery, to avoid any risk of data corruption.
This particular point of recovery is often a bone of contention between developers and database administrators, because developers tend to consider, not unreasonably, that backups and recoveries belong to administrators, while administrators point out, logically, that if they can guarantee that the container is in working order, they have no idea about the status of the contents. Indeed, any functional check in case of recovery should not be forgotten by developers. The more complicated the overall design, the more important it is for developers to keep in mind the constraints of operations.

Important

Database systems are joint ventures; they need the active and cooperative participation of users, administrators, and developers.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Completed Plans
We have reviewed the basic foundations for laying plans in constructing a database system. We have reviewed the fundamentals of data modeling , and in particular the broad steps involved in normalizing data to third normal form. We have then proceeded to review a number of scenarios, in which a faulty design can be identified as the road to disaster.
Most examples in this chapter come directly from or are inspired by cases I have encountered in some big companies. And it is always striking to consider how much energy and intelligence can be wasted trying to solve performance problems that are born from the ignorance of elementary design principles. Such performance issues need not be present, yet they are quite common and often made worse by further denormalization of what is already a questionable design, on the unassailable grounds of "performance improvement." One query may, in fact, run much faster, but unfortunately, the nightly batch program now takes twice as long. In this way, and almost without being noticed, a full information system is built on a foundation of sand.

Important

Successful data modeling is the disciplined application of what are, fundamentally, simple design principles.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Waging War
Accessing Databases Efficiently
Général Antoine-Henri de Jomini (1779-1869)
Précis de l'Art de la Guerre
Il existe un petit nombre de principes fondamentaux de la guerre, dont on ne saurait s'écarter sans danger, et dont l'application au contraire a été presque en tous temps couronnée par le succès.
There exist a small number of fundamental principles of war, which it is dangerous to ignore: indeed, following these principles has almost invariably led to success.
Anybody who has ever been involved in the switch from development to production of a critical system knows how much it can feel like the noise and tumult of battle. Very often, a few weeks before D-Day, performance tests will show that the new system is going to fall short of expectations. Experts are brought in, SQL statements are fine-tuned, and database and system administrators are called to contribute to a succession of crisis meetings. Finally, performance vaguely comparable to the previous system is obtained on hardware that is now twice as expensive as the original installation.
Tactics are often used as a substitute for a strategic approach. The latter demands the adoption of a sound overall architecture and design. As in war, the basic principles here are also few, but too often ignored. Architectural mistakes can prove extremely costly, and the SQL programmer must enter the battle fully prepared, knowing where to go and how to get there. In this chapter, we are going to review the key goals that will increase our chances of success in writing programs that access databases efficiently.
For centuries, the only means that a general had to check the progress of his troops during the heat of battle was to observe the position of his units as indicated by the color of the soldiers' uniforms and the flags they were carrying. When some process in the database environment is consuming an inordinate amount of CPU, it is often possible to identify which piece of SQL code is actually running. But it is very often much more difficult, especially in a large and complicated system that includes dynamically built queries, to identify which precise part of a given application issued that statement and needs reviewing. Despite the fact that many products have good monitoring facilities, it is sometimes surprisingly difficult to relate an SQL statement to its broader environment. Therefore, you should adopt the habit of identifying your programs and critical modules whenever possible by inserting comments into your SQL to help identify where in the programs a given query is used. For instance:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Query Identification
For centuries, the only means that a general had to check the progress of his troops during the heat of battle was to observe the position of his units as indicated by the color of the soldiers' uniforms and the flags they were carrying. When some process in the database environment is consuming an inordinate amount of CPU, it is often possible to identify which piece of SQL code is actually running. But it is very often much more difficult, especially in a large and complicated system that includes dynamically built queries, to identify which precise part of a given application issued that statement and needs reviewing. Despite the fact that many products have good monitoring facilities, it is sometimes surprisingly difficult to relate an SQL statement to its broader environment. Therefore, you should adopt the habit of identifying your programs and critical modules whenever possible by inserting comments into your SQL to help identify where in the programs a given query is used. For instance:
    /* CUSTOMER REGISTRATION */ select blah ...
These identifying comments can be important and helpful in subsequently tracking down any erroneous code. They can also be helpful when trying to determine how much load is put on a server by a single application, especially when some localized increase in activity is expected and when you are trying to assess whether the current hardware can absorb the surge.
Some products have special registration facilities that can spare you the admittedly tedious step of commenting each and every statement. Oracle's dbms_application_info package allows you to register a program using a 48-character module name, a 32-character action name, and a 64-character client information field. The content of those fields is left to your discretion. In an Oracle environment, you can use this package to keep track not only of which application is running, but also what that application is doing at any given time. This is because you can easily query the information that your application passes to the package through the Oracle V$ dynamic views that show what is currently happening in memory.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Stable Database Connections
A new database connection can be created quickly and easily, but this ease can disguise the high cost of making repeated connections. You must manage the use of database connections with great care. The consequences of allowing multiple connections to occur, perhaps hidden within an application, can be substantial, as the next example illustrates.
Some time ago I came across an application in which numerous small files of up to an arbitrary maximum of 100 lines were being processed. Each line in these small text files contained both data and the identification of the database instance into which that data had to be loaded. In this particular case, there was a single server, but the principle being illustrated is exactly the same as if there were a hundred database instances.
The process for each file was coded as follows:
    Open the file
    Until the end of file is reached
          Read a row
          Connect to the server specified by the row
          Insert the data
          Disconnect
    Close the file
This process worked quite satisfactorily, except for the occasional circumstance in which a large number of small files would arrive in a very short space of time, and at a rate greater than the ability of the application to process them. This resulted in a substantial backlog, which took considerable time to clear.
I explained the problem of performance degradation as a consequence of frequent connection and disconnection to the customer with the help of a simple program (written in C) emulating the current application. Table 2-1 gives the results from that demonstration.
The program generating the results in Table 2-1 used a conventional insert statement. I mentioned in passing to the customer the existence of direct-loading techniques that are even faster.
Table 2-1: Result of connect/disconnect performance tests
Test
Results
Connect/disconnect for each line in turn
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Strategy Before Tactics
Strategy defines the tactics, not the other way round. A skillful developer doesn't think of a process in terms of little steps, but in terms of the final result. The most efficient way to obtain that result may not be to proceed in the order specified in the business rules, but rather to follow a less obvious approach. The following example will show how paying too much attention to the procedural processes within a business can distract ones' attention from the most efficient solution.
Some years ago I was given a stored procedure to try to optimize; "try" is the operative word here. Two attempts at optimization had already been made, once by the original authors, and secondly by a self-styled Oracle expert. Despite these efforts, this procedure was still taking 20 minutes to run, which was unacceptable to the users.
The purpose of the procedure was to compute quantities of raw materials to be ordered by a central factory unit, based on existing stocks and on orders that were coming from a number of different sources. Basically, the data from several identical tables for each data source had to be aggregated inside one master table. The procedure consisted of a succession of similar statements simplified as follows. First, all data from each distinct source table were inserted into the single master table. Second, an aggregate/update was applied to each instance of raw material in that master table. Finally, the spurious data not relevant to the aggregate result was deleted from the table. These stages were repeated in sequence inside the procedure for every distinct source table. None of the SQL statements were particularly complex, and none of them could be described as being particularly inefficient.
It took the better half of a day to understand the process, which eventually prompted the question: why was this process being done in multiple steps? A subquery in a from clause with a union operator would allow the aggregation of all the various sources. A single
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Problem Definition Before Solution
A little knowledge can be a dangerous thing. Frequently, people may have read or heard about new or unusual techniques—which in some cases can indeed be quite interesting—and then they will try to fit their problem to one of these new solutions. Ordinary developers and architects often jump quickly on to such "solutions," which often turn out to be at the root of many subsequent problems.
At the top of the list of ready-made solutions, we usually meet denormalization. Blissfully unaware of the update nightmare that it turns out to be in practice, denormalization advocates often suggest it at an early stage in the hunt for "performance"--and in fact often at a point in the development cycle when better design (or learning how to use joins) is still an option. A particular type of denormalization, the materialized view, is also often seen as being something of a panacea. (Materialized views are sometimes referred to as snapshots , a less impressive term, but one that is closer to the sad reality: copies of data at one point in time.) This is not to say that sometimes, as a last resort option, theoretically questionable techniques cannot be used. To quote Franz Kafka: "Logic is doubtless unshakable, but it cannot withstand a man who wants to go on living."
But the immense majority of problems can be solved using fairly traditional techniques in an intelligent manner. Learn first how to get the best of simple, traditional techniques. It's only when you can fully master them that you will be able to appreciate their limitations, and then to truly be able to judge the potential advantage (if any) of new technical solutions.
All technological solutions are merely means to an end; the great danger for the inexperienced developer is that the attractions of the latest technology become an end in themselves. And the danger is all the greater for enthusiastic, curious, and technically minded individuals!

Important

Foundations before Fashion: learn your craft before playing with the latest tools.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Stable Database Schema
The use of data definition language (DDL) to create, alter, or drop database objects inside an application is a very bad practice that in most cases should be banned. There is no reason to dynamically create, alter, or drop objects, with the possible exception of partitions—which I describe in Chapter 5--and temporary tables that are known to the DBMS to be temporary tables. (We shall also meet another major exception to this rule in Chapter 10.)
The use of DDL is fundamentally based on the core database data dictionary. Since this dictionary is also central to all database operations, any activity on it introduces global locks that can have massive performance consequences. The only acceptable DDL operation is truncate table, which is a very fast way of emptying a table of all rows (without the protection of rollback recovery, remember!).

Important

Creating, altering, or dropping database objects belong to application design, not to regular operations.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Operations Against Actual Data
Many developers like to create temporary work tables into which they extract lists of data for subsequent processing, before they begin with the serious stuff. This approach is often questionable and may reflect an inability to think beyond the details of the business processes. You must remember that temporary tables cannot offer storage options of the same degree of sophistication as permanent tables (you see some of these options in Chapter 5). Their indexing, if they are indexed, may be less than optimal. As a result, queries that use temporary tables may perform less efficiently than well-written statements against permanent tables, with the additional overhead of having to fill temporary tables as a prerequisite to any query.
Even when the use of temporary tables is justified, they should never be implemented as permanent tables masquerading as work tables if the number of rows to be stored in them is or can be large. One of the problems lies in the automated collection of statistics: when statistics are not collected in real time, they are typically gathered by the DBMS at a time of zero or low activity. The nature of work tables is that they will probably be empty at such slack times, thus giving a wholly erroneous indicator to the optimizer. The result of this incorrect, and biased, statistical data can be totally inappropriate execution plans that not surprisingly lead to dismal performance. If you really have to use temporary storage, use tables that the database can recognize as being temporary.

Important

Temporary work tables mean more byte-pushing to less suitable storage.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Set Processing in SQL
SQL processes data in complete sets. For most update or delete operations against a database —and assuming one is not operating against the entire table contents—one has to define precisely the set of rows in that table that will be affected by the process. This defines the granularity of the impending process, which may be described as coarse if a large number of rows will be affected or as fine if only few rows will be involved.
Any attempt to process a large amount of data in small chunks is usually a very bad idea and can be massively inefficient. This approach can be defended only where very extensive changes will be made to the database which can, first, consume an enormous amount of space for storing prior values in case of a transaction rollback, and second, take a very long time to rollback if any attempted change should fail. Many people would argue that where very considerable changes are to be made, regular commit statements should be scattered throughout the data manipulation language (DML) code. However, regular commit statements may not help when resuming a file upload that has failed. From a strictly practical standpoint, it is often much easier, simpler, and faster to resume a process from the start rather than try to locate where and when the failure occurred and then to skip over what has already been committed.
Concerning the size of the log required to rollback transactions in case of failure, it can also be argued that the physical database layout has to accommodate processes, and not that processes have to make do with a given physical implementation. If the amount of undo storage that is required is really enormous, perhaps the question should be raised as to the frequency with which changes are applied. It may be that switching from massive monthly updates to not-so-massive weekly ones or even smaller daily ones may provide an effective solution.

Important

Thousands of statements in a cursor loop for endless batch processing, multiple statements applied to the same data for users doomed to wait, one swoop statement to outperform them all.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Action-Packed SQL Statements
SQL is not a procedural language. Although procedural logic can be applied to SQL, such approaches should be used with caution. The confusion between procedural and declarative processing is most frequently seen when data is required to be extracted from the database, processed, and then re-inserted back into the database. When a program—or a function within a program—is provided with some input value, it is all too common to see that input value used to retrieve one or several other values from the database, followed by a loop or some conditional logic (usually if...then...else) being applied to yet other statements applied to the database. In most cases, this behavior is the result of deeply ingrained bad habits or a poor knowledge of SQL, combined with a slavish obsession with functional specifications. Many relatively complex operations can be accomplished in a single SQL statement. If the user provides some value, try to get the result set that is of interest without decomposing the process into multiple statements fetching intermediate results of only minimal relevance to the final output.
There are two main reasons for shunning procedural logic in SQL:
Any access to the database means crossing quite a number of software layers, some of which may include network accesses.
Even when no network is involved, there will be interprocess communications; more accesses mean more function calls, more bandwidth, and more time waiting for the answer. As soon as those calls are repeated a fair number of times, the impact on process performance can become distinctly perceptible.
Procedural means that performance and future maintenance burdens fall to your program.
Most database systems incorporate sophisticated algorithms for executing operations such as joins, and for transforming queries so as to execute them in a more efficient way. Cost-based optimizers (CBOs) are complex pieces of software that have sometimes grown from being totally unusable when originally introduced to becoming mature products, capable of giving excellent results in most cases. A good CBO can be extremely efficient in choosing the most suitable execution plan. However, the scope of operation of the CBO is the SQL statement, nothing more. By doing as much as possible in a single statement, you shift the burden of achieving the best possible performance from your program to the DBMS kernel. You enable your program to take advantage of any improvement to the DBMS code, and therefore you are indirectly shifting a large part of the future maintenance of your program to the DBMS vendor.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Profitable Database Accesses
When you plan a visit to several shops, the first step is to decide what purchases have to be made at each shop. From this point, a trip is planned that will ensure minimum repetitive walking backward and forward between different shops. The first shop is then visited, the purchase completed, and then the next closest shop is visited. This is only common sense, and yet the principle underlying this obvious approach is not seen in the practical implementation of many database programs.
When several pieces of information are required from a single table—even if it appears as if they are "unrelated" (which in fact is unlikely to be the case)--it is highly inefficient to retrieve this data in several separate visits to the database. For example, do not fetch row values column by column if multiple columns are required: do the work in one operation.
Unfortunately, good object-oriented (OO) practice makes a virtue out of defining one method for returning each attribute. But do not confuse OO methods with relational database processing. It is a fatal mistake to mix relational and object-oriented concepts and to consider tables to be classes with columns as the attributes.

Important

Maximize each visit to the database to complete as much work as can reasonably be achieved for every visit.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Closeness to the DBMS Kernel
The nearer to the DBMS kernel your code can execute, the faster it will run. This is where the true strength of the database lies. For example, several database management products allow you to extend them by adding new functions, which can sometimes be written in comparatively low-level languages such as C. The snag with a low-level language that manipulates pointers is that if you mishandle a pointer, you can end up corrupting memory. It would be bad enough if you were the only user affected. But the trouble with a database server is that, as the name implies, it can serve a large number of users: if you corrupt the server memory, you can corrupt the data handled by another, totally innocent program. As a consequence, responsible DBMS kernels run code in a kind of sandbox, where it can crash without taking everything with it in its downfall. For instance, Oracle implements a complicated communication mechanism between external functions and itself. In some ways, this process is similar to that which controls database links, by which communication between two (or more) database instances on separate servers is managed. If the overall gain achieved by running tightly tailored C functions rather than stored PL/SQL procedures is greater than the costs of setting up an external environment and context-switching, use external functions. But do not use them if you intend to call a function for every row of a very large table. It is a question of balance, of knowing the full implications of the alternative strategies available to solve any given problem.
If functions are to be used, try to always use those that are provided by the DBMS. It is not merely a matter of not reinventing the wheel: built-in functions always execute much closer to the database kernel than any code a third-party programmer can construct, and are accordingly far more efficient.
Here is a simple example using Oracle's SQL that will demonstrate the efficiencies to be gained by using Oracle functions. Let's assume we have some text data that has been manually input and that contains multiple instances of adjacent "space" characters. We require a function that will replace any sequence of two or more spaces by a single space. Ignoring the regular expressions available since Oracle Database 10
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Doing Only What Is Required
Developers often use count(*) for no purpose other than to implement an existence test. This usually happens as a result of a specification such as:
    If there are rows meeting a certain condition
    Then do something to them
which immediately becomes:
    select count(*)
    into counter
    from table_name
    where <certain_condition>

    if (counter > 0) then
Of course in 90% of the cases the count(*) is totally unnecessary and superfluous, as in the above example. If an action is required to operate on a number of rows, just do it. If no row is affected, so what? No harm is done. Moreover, if the process to be applied to those hypothetical rows is complex, the very first operation will tell you how many of them were affected, either in a system variable (@@ROWCOUNT with Transact-SQL, SQL%ROWCOUNT with PL/SQL, and so forth), in a special field of the SQL Communication Area (SQLCA) when using embedded SQL, or through special APIs such as mysql_affected_rows( ) in PHP. The number of processed rows is also sometimes directly returned by the function, which interacts with the database, such