A subquery is a SELECT statement embedded within another SQL statement. You can use subqueries in place of column references, in place of table references, to generate values for comparison in a WHERE or HAVING condition, and to generate values for inserts and updates.
So far in this chapter, you’ve selected data from tables. A SELECT statement is executed, and a set of rows comes back as the result. Imagine if you could further treat that set of rows as a table against which you issue another SELECT statement or perhaps an UPDATE or a DELETE. Using a subquery, you can do just what I’ve described, and that can lead to some interesting and elegant solutions to SQL problems.
One use for a subquery in the FROM clause is to aggregate data that has already been aggregated. For example, you might be faced with the following business problem:
Find all employees who have worked on projects 1001 and 1002 during the year 2004. Sum the number of hours each of those employees has worked on each project during that year. Report the ranges. Show the lowest number of hours that any employee worked on each project during 2004, as well as the highest number of hours.
This is an interesting problem to solve because you must sum the hours for employee/project combinations using the SUM aggregate function, and you must apply MIN and MAX to your sums. Example 4-32 shows one way to approach this problem using a subquery.
SELECT phs.project_id, MIN(phs.hours), MAX(phs.hours)
FROM (SELECT employee_id, project_id, SUM(hours_logged) hours
WHERE project_id IN (1001, 1002)
AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004', 'dd-Mon-yyyy')
GROUP BY employee_id, project_id) phs
GROUP BY project_id;PROJECT_ID MIN(PHS.HOURS) MAX(PHS.HOURS) ---------- -------------- -------------- 1001 4 20 1002 8 24
The subquery appears in the FROM clause enclosed in parentheses.
The outer query treats the rows from the subquery in the same manner
as it would treat rows from a table. Tables have names, and so should
subqueries. Example 4-32
gives the name, or alias,
the subquery. The outer query uses that name to refer to the columns
from the subquery. The alias
is given to the column represented by
SUM(hours_logged), making it easy to refer
to that column from the outer query.
The call to
TRUNC(time_log_date,'Year') converts each
date to January 1st of its year. Any date
in 2004 will be truncated to 1-Jan-2004. It’s easy, then, to
eliminate time log entries for other years.
Because subqueries in the FROM clause are treated the same as
tables, it stands to reason they can take part in joins. Example 4-33 expands on Example 4-32 by adding another
level of subquery and a join to the
project table in order to include project
names in the query output.
SELECT minmax_hours.project_id, p.project_name, minmax_hours.min_hours, minmax_hours.max_hours FROM ( SELECT phs.project_id, MIN(phs.hours) min_hours, MAX(phs.hours) max_hours FROM (SELECT employee_id, project_id, SUM(hours_logged) hours FROM project_hours WHERE project_id IN (1001, 1002) AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004', 'dd-Mon-yyyy') GROUP BY employee_id, project_id) phs GROUP BY project_id) minmax_hours JOIN project p ON minmax_hours.project_id = p.project_id; PROJECT_ID PROJECT_NAME MIN_HOURS MAX_HOURS ---------- ---------------------------------------- ---------- ---------- 1001 Corporate Web Site 4 20 1002 Enterprise Resource Planning System 8 24
There are different ways you can approach the join to
project shown in Example 4-33. The join could
occur in the innermost subquery or in the middle subquery, and either
of those alternatives would eliminate the need for a third subquery.
Sometimes it’s a judgment call as to which approach is best. Sometimes
you need to make that call based on readability or on a desire not to
mess with a working query. By adding an outer SELECT statement to an
already working query, I avoided the need to tamper with a SELECT
statement, the middle one in Example 4-33, which I knew
There’s a performance issue, too, that isn’t obvious in the
amount of sample data used for the examples in this book. Joining to
project in the innermost query
would force the join to take place before any aggregation at all.
Given a large enough number of rows in
project_hours, a join that early would
significantly increase the expenditure of I/O and CPU resources by the
query, as all those detail rows would need to be joined. Having the
join occur where it does in Example 4-33 means that only
two rows, the two returned by the middle query, need to be joined to
Subqueries in the FROM clause are sometimes referred to as inline views, and such a subquery can be considered as a dynamically created view, for the purpose of the one query. Any subquery in the FROM clause can be replaced by an equivalent view, but then you have the problem of creating that view, which is a permanent database object.
You can see another interesting use of inline views, this time to generate rows of data that don’t already exist in the database, in the article Creating Pivot Tables at http://gennick.com/pivot.html.
You can use subqueries to see whether a row is representative of a set. The query in Example 4-32 contains the following WHERE condition:
WHERE project_id IN (1001, 1002)
This condition tests whether a row from the
project_hours table is associated with
project 1001 or 1002. Imagine a more complicated scenario. Imagine
that you’re interested in all projects having budgets of $1,000,000 or
more. You don’t know which projects have such large budgets.
Furthermore, budgets change from time to time, and you don’t want to
have to modify the project ID numbers in your query each time your
budgets change. Instead of hard-coding the
project_id list for the IN predicate, you
can generate that list using a subquery in the IN predicate, as shown
in Example 4-34.
SELECT phs.project_id, MIN(phs.hours), MAX(phs.hours) FROM (SELECT employee_id, project_id, SUM(hours_logged) hours FROM project_hours WHERE project_id IN (SELECT project_id FROM project WHERE project_budget >= 1000000) AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004','dd-Mon-yyyy') GROUP BY employee_id, project_id) phs GROUP BY project_id;
When you execute the SELECT statement shown in Example 4-34, the IN
predicate’s subquery is executed first to generate a list of project
IDs encompassing all projects having budgets of $1,000,000 or more.
Another approach to this problem is to issue a subquery for each
project_hour row to see whether the
associated project has the required budget. This approach is shown in
SELECT phs.project_id, MIN(phs.hours), MAX(phs.hours) FROM (SELECT employee_id, project_id, SUM(hours_logged) hours FROM project_hours ph WHERE EXISTS (SELECT * FROM project pb WHERE pb.project_budget >= 1000000 AND pb.project_id = ph.project_id) AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004','dd-Mon-yyyy') GROUP BY employee_id, project_id) phs GROUP BY project_id;
The subquery in Example
4-35 is known as a correlated subquery,
meaning that the subquery references a value from the parent
statement. Aliases are important when writing such subqueries. The
is aliased as
ph. The subquery
references the project ID from each candidate row from the main query
via the reference to
The subquery is executed once for each row that can potentially be
returned by the main query.
Subqueries prior to Example 4-35 have all been non-correlated, which means that you can execute them independently of their enclosing statement. Non-correlated subqueries are executed just once per execution of their containing query. Correlated subqueries are executed once per row. Either subquery may be written using IN or EXISTS, but typically you’ll find that IN subqueries are non-correlated and EXISTS subqueries are correlated.
It’s pretty much impossible to provide a general rule as to when
to use IN and EXISTS predicates when either can be used to solve a
particular problem. Given the low number of rows in the
project table as compared to the
project_hours table, you might think that
the SELECT in Example
4-34 would be more efficient than the one in Example 4-35. After all, the
IN predicate’s subquery must execute only once, and return only two
values. However, in my database, with the release of Oracle that I’m
running and the sample data I’m using for this book, the SELECT in
Example 4-35 requires
far less input/output than Example 4-34. A good practice
is to test both approaches, perhaps using SET AUTOTRACE ON as
described in Chapter 12, and
then use the one that performs best for your particular query.
Subqueries are useful in generating values for INSERT and UPDATE statements. Examples Example 4-36 and Example 4-37 show two ways of creating a reporting table of project billing data summarized by week. You might generate, and periodically refresh, such a table to make it easier for end users to check on project status using ad hoc query tools.
Example 4-36 uses a CREATE TABLE AS SELECT FROM statement, which lets you create and populate the reporting table in one step. Example 4-37 creates the reporting table separately, and then populates the table using an INSERT...SELECT FROM statement. In each case, a subquery generates the data to be inserted into the new table.
DROP TABLE project_time; CREATE TABLE project_time AS SELECT EXTRACT (YEAR FROM time_log_date) year, TO_NUMBER(TO_CHAR(time_log_date,'ww')) week_number, p.project_id, p.project_name, SUM(ph.hours_logged) hours_logged FROM project p JOIN project_hours ph ON p.project_id = ph.project_id GROUP BY EXTRACT (YEAR FROM time_log_date), TO_NUMBER(TO_CHAR(time_log_date,'ww')), p.project_id, p.project_name;
DROP TABLE project_time; CREATE TABLE project_time ( year NUMBER, week_number NUMBER, project_id NUMBER(4), project_name VARCHAR2(40), hours_logged NUMBER); INSERT INTO project_time SELECT EXTRACT (YEAR FROM time_log_date), TO_NUMBER(TO_CHAR(time_log_date,'ww')), p.project_id, p.project_name, SUM(ph.hours_logged) FROM project p JOIN project_hours ph ON p.project_id = ph.project_id GROUP BY EXTRACT (YEAR FROM time_log_date), TO_NUMBER(TO_CHAR(time_log_date,'ww')), p.project_id, p.project_name;
Having created the
project_time table with its redundant
project_name column, it’s a given
that someone, someday, will come along and change a project name on
you, leaving you to sort out the resulting mess by somehow propagating
the new project name to all the summary rows in
project_time. That kind of update is easily
done using a subquery in the SET clause of an UPDATE statement, as in
UPDATE project_time pt SET pt.project_name = (SELECT p.project_name FROM project p WHERE p.project_id = pt.project_id) WHERE pt.project_name <> (SELECT p.project_name FROM project p WHERE p.project_id = pt.project_id)
The UPDATE in Example
4-38 reads each row in the
project_time table and updates those
project_time rows subject to name