Chapter 4. Hierarchies in SQL
Hierarchical structures have a sort of nondeterministic nature in that the exact structure is determined only when you populate the hierarchy with data. This makes them appealing for various sorts of applications. An employment hierarchy is a classical example of such a structure. A company will have employees. Supervisors will be assigned to lead groups of employees. Those supervisors, in turn, will report to managers. Low-level managers will report to higher-level managers. Eventually, you get to the top where you often find a chief executive officer (CEO). If you sketch it out, the typical company organization chart will look like an upside down tree, with some branches in the tree being wider and narrower than others. Hierarchical structures are widely used in procedural languages, such as C, but are rather underutilized in SQL, because they conflict with the inherent table-shaped data layout of relational databases.
Almost any type of complex application can make use of a hierarchical model to some degree or another. The recipes in this chapter show you how to manipulate hierarchical data using Transact-SQL. This chapter will discuss three major topics:
Some vendors, Oracle Corporation being among the most notable, have extended their SQL syntax with additional functionality to support querying hierarchical data. Unfortunately, neither Sybase nor Microsoft have chosen to implement such extensions in Transact-SQL. Although the Transact-SQL language hasn’t been designed to support dynamic structures such as trees, programmers have come up with some reasonably efficient design patterns that address the common operations that you need to perform on hierarchical data. As an added bonus, the lack of vendor-specific functionality tends to make these solutions more easily portable across database platforms.
Types of Hierarchies
There are two types of hierarchies that we recognize in this chapter: specialized hierarchies and general hierarchies. In a specialized hierarchy, you are limited in the maximum depth of the hierarchy. As you’ll see in the recipes, SQL is quite unique since you can implement proper general hierarchies with unlimited depth. In a general hierarchy, you cannot define a maximum depth.
In the first part of this chapter, we are going to show you how specialized hierarchies can be implemented in a SQL Server database. We’ll also show you how you can make use of their special property. Our recipe solutions are based on a real-world example of a hierarchical structure used to manage permissions in an online trading system.
Next, our discussion will turn to general hierarchies. Because general hierarchies allow nodes to be nested to any depth, manipulating them requires using generalized solutions. Often, these solutions involve recursion. We’ll begin by showing you what you can do using a pure hierarchical model, and we’ll show both recursive and nonrecursive algorithms. Then, we’ll show you how you can improve the efficiency of a hierarchical model by adding what we term service information to it.
Before going further into the chapter, we’d like to review some of the terminology that we use when we discuss hierarchies. Figure 4-1 illustrates a hierarchical structure, and many of the elements are labeled with the terms that we use to refer to them.
The following list contains our definitions for the terms shown in Figure 4-1, as well as for some other terms that we use in this chapter:
Refers to a specific manifestation of a hierarchical structure. A specific organizational chart would be an example of a tree. As Figure 4-1 shows, hierarchically structured data may consist of more than one tree.
- Leaf node
Refers to a node that has other nodes (child nodes) underneath it.
In a hierarchy, child nodes are subordinate to, and belong to, their parents. If you delete a node, or move a node, that operation affects not only the node on which you are operating, but also all of its children. For example, if you delete node 2, as shown in Figure 4-1, you are also implicitly deleting nodes 3 and 4.
Specialized hierarchies can be found in almost any real-world application. However, they aren’t always manifested in an obvious way, and programmers often don’t even realize that what they are actually doing is programming a hierarchy. When programming for a specialized hierarchy, you can take advantage of limitations such as a limited number of nodes, a limited number of levels, the fact that only leaves hold data, or some other specialized aspect of the hierarchy in question.
As a basis for the recipes in this chapter, we’ve used a permission-hierarchy example that is taken from an online trading system that is actually in production use. We’ve simplified our example somewhat for demonstration purposes. The purpose of the mechanism is to implement support for an order-routing system in which traders have different permissions relating to trades that they can make. An automatic system processes orders according to given permission rules. If a trader sends an order through a system for which he is not authorized, the system rejects the order before submitting it to the market. The following data is an example of some of the rules that the system must enforce:
Trader Type Limit --------------------------- Alex Shares 100 Alex Futures 10 Alex Bond 10 Cindy Shares 1000 Cindy Bonds 1000 Michael Shares 200 ...
As you can see, there are several types of permissions granted to each trader. Alex, for example, can trade shares of stock in blocks of no more than 100 shares. Cindy, on the other hand, is allowed to trade in blocks of up to 1,000 shares. If you were designing a database structure to hold rules like these, the simplistic approach would be simply to define permissions for each specific trader using the system. However, it’s likely to be more practical to group those permissions so that supervisors can easily grant a set of permissions to a particular trader. For example, you could have permission groups for commodity traders, future traders, and spot market traders. The resulting hierarchy would be the limited hierarchy shown in Figure 4-2.
Some might look at this scheme and see two distinct entities: groups and permissions. That’s probably a valid view, but we would argue that, at a higher level of abstraction, this structure represents a limited hierarchy of only two levels. The highest level is the group level, and the level below that is the permission level. When writing code to query and manage this hierarchy, you can take advantage of the fact that it’s limited to only two levels of depth. You can use solutions that are specific for that case that are more efficient than those that you would need to navigate a more generalized and unrestricted hierarchy. The recipes we have in this chapter highlight the considerations that you need to take into account, as well as the efficiencies that you gain, when working with a specialized hierarchy such as that shown in Figure 4-2.
The standard model for representing general hierarchies is to set up a parent/child relationship. The implementation in Transact-SQL, however, is different from tree-structure implementations in other programming languages. Hierarchical structures implemented in Transact-SQL have two characteristics that distinguish them from hierarchical structures implemented in a programming language such as C:
A lack of pointers
An unlimited number of children
Programmers using C, or other procedural languages, to implement hierarchies usually deal with fixed-size structures in which the parent and child nodes are linked with pointers. Moving from one node to another is fairly inexpensive in terms of the resources that are consumed, because such a move involves only the swapping of a couple of pointers in memory. With SQL, on the other hand, a move between nodes usually involves disk I/O, because each node ends up being a row in a table. Disk I/O is expensive in terms of the time and resources consumed, so SQL programmers need to be very careful about the number of visits they make to each node. An advantage of SQL, though, is that it’s easy to allow for an unlimited number of children. Unlike the fixed structures often used in procedural languages, database tables have no arbitrary limit on the number of rows they may contain.
The design of a hierarchy is a good example of a case in which how you model your data is much more important than the actual algorithms that you use to manipulate your data. The following CREATE TABLE statement shows the structure most often used to represent general hierarchies in a relational database:
CREATE TABLE Hierarchy( VertexId INTEGER, Parent INTEGER, ... PRIMARY KEY(VertexId) )
In this structure, each row in the Hierarchy table represents one node. The VertexId field contains a unique address for each node. Each node then points back to its parent by storing the parent’s VertexId value in the Parent field. Using such a structure, you can create an unlimited number of children for each parent.
SQL programmers have been using this type of structure for some time, and it looks great from a design point of view. However, SQL has strangely weak support for hierarchies. If you aren’t careful, the code you write to manipulate a hierarchical structure, such as the one shown here, can be inefficient and lead to a host of performance problems.
We have several recipes in this chapter that show reasonably efficient solutions for common operations that you’ll need to perform on generalized hierarchical data. These solutions are based on the example of a project-tracking database that you manage. In this project-tracking database, you store data about projects and their subprojects. Any given project can be broken up into an arbitrary number of subprojects for planning purposes. Those subprojects can be further broken up into even smaller subprojects. This process can continue for any number of levels of depth. Eventually, you get down to the bottom-level in which the subprojects have no children. A subproject with no children is referred to as a task. Your job is to build SQL queries, procedures, and functions in support of a project-tracking application.
Data about projects and subprojects is stored in the Projects table, which is defined as follows:
CREATE TABLE Projects( Name VARCHAR(20), Cost INTEGER, Parent INTEGER, VertexId INTEGER, Primary key(VertexId) )
The first two columns — the Name and Cost columns — represent the actual project and subproject data. In a real application, you would have more information than that, but these two columns will suffice for the purposes of example. The remaining two columns in the table are used to define the hierarchy. The VertexId column contains a unique identifier for each project, subproject, and task. The Parent column in any given row points to the parent project or subproject for that row. Figure 4-3 provides a visual illustration of this hierarchy.
The scripts named ch04.*.objects.sql (downloadable from this book’s web site) will create and populate all the tables used in this chapter’s recipes. When you run the script, you’ll get the following data in the Projects table:
Name Cost Parent VertexId -------------------- ----------- ----------- ----------- New SW 0 0 1 Specifications 0 1 2 Interviews 5 2 3 Drafts 10 2 4 Consolidations 2 2 5 Final document 15 2 6 Presentation 1 6 7 Prototype 0 1 8 UI Design 10 8 9 Calculations 10 8 10 Correctness Testing 3 10 11 Database 10 8 12 Development 30 1 13 UI Implementation 10 13 14 Coding 20 13 15 Initial testing 40 13 16 Beta testing 40 1 17 Final adjustments 5 17 18 Production testing 20 1 19
As you can see, the Parent column is used to maintain the hierarchical structure of the project data. Each child has only one parent, so only one Parent column is needed for each row in the table. Consequently, an unlimited number of children can be created for any given project or subproject. Figure 4-4 shows a graphical view of our example project.
Creating a Permission Hierarchy
You need to create a system for maintaining a simple permission hierarchy for a trading system. This hierarchy needs to be similar to the one described earlier in Section 4.1.1. Permissions should be manageable in groups so a set of permissions can be granted to a trader easily. In addition, your solution needs to allow for permission types to be added dynamically while the system is in production.
Define two tables. In the first table, define groups of permissions; in the second table, link those groups to particular traders. Each trader can belong to more than one permission group. The table defining the permission groups can be defined as follows:
CREATE TABLE GroupPermissions( GroupId VARCHAR(20) NOT NULL, ProductType VARCHAR(10) NOT NULL, Status CHAR(1) CHECK(status in ('V','S')) DEFAULT('V'), Limit NUMERIC(10,2) NULL, PRIMARY KEY(GroupId,ProductType) )
In our GroupPermissions table, there are
attributes describing trading permissions for each product group. One
attribute is a
status attribute. The status of
each trading permission can be set to valid (
or suspended (
'S'). The second attribute is an
order limit that defines the maximum permitted size for a trade
involving the product group. If a trader is authorized to trade bonds
and has a trading limit of 1,000, that trader is not allowed to
initiate a trade involving more than 1,000 bonds. The following is a
representative sample of the type of data that you could store in
GroupId ProductType Status Limit -------------------- ----------- ------ ------------ Debt Bill V 10000.00 Debt Bond V 10000.00 Derivatives Future V 200.00 Derivatives Option V 100.00 Equities Share V 1000.00
The second table needs to keep track of the permission groups assigned to each trader. Its definition can be as follows:
CREATE TABLE GroupMembership( AccountId VARCHAR(20) NOT NULL, GroupId VARCHAR(20) NOT NULL PRIMARY KEY(AccountId,GroupId) )
This table simply links accounts with groups. Whenever a trader is assigned to a new group or removed from an existing group, the table must be updated to reflect that fact. The following sample shows the group assignments for three traders:
AccountId GroupId -------------------- -------------------- Alex0001 Debt Alex0001 Derivatives Alex0001 Equities Betty0002 Derivatives Charles0003 Debt
You can see from this example that a single trader can belong to multiple groups. In this case, Alex0001 belongs to three groups: Debt, Derivatives, and Equities. In the next section, we’ll describe one way to handle cases where a trader’s permissions from one group conflict with his permissions from another group.
The main goal of our solution is to separate the definition of permissions and groups from the assignment of groups to traders. Please note that the GroupPermissions table is not fully normalized. To normalize it, you would have to separate the definition of groups from the definition of products and then use an intersection table to define the Limit and Status attributes. The lack of normalization here simplifies the example somewhat, but it doesn’t affect the overall behavior of our design nor does it affect the queries that you’ll see in this section.
The goal of normalization is to provide specific benefits to database design — reduced redundancy being the foremost benefit. However, there are many situations in which reduced redundancy is less valuable than the performance gains you get from a denormalized structure.
Checking permissions for a trader
Queries regarding trading authorizations can easily be performed on the model that we’ve designed. For example, to list the trading permissions for the account Alex0001, all you need is the following simple join:
SELECT m.AccountId, g.ProductType, MIN(g.Limit) Limit FROM GroupMembership m JOIN GroupPermissions g ON m.groupId=g.groupId WHERE Status='V' AND AccountId='Alex0001' GROUP BY m.AccountId, g.ProductType AccountId ProductType Limit -------------------- ----------- ------------ Alex0001 Bill 10000.00 Alex0001 Bond 10000.00 Alex0001 Future 200.00 Alex0001 Option 100.00 Alex0001 Share 1000.00
While the results of this query are straightforward, the use of the GROUP BY clause together with the aggregate function MIN deserves some additional explanation. The grouping allows us to deal with cases where more than one of a trader’s permission groups defines a limit for the same product. In our example data, both the Debt and Equities groups define a trading limit for bonds. Alex0001 is a member of both groups, so an ungrouped query would return the following two records:
AccountId ProductType Limit -------------------- ----------- ------------ Alex0001 Bond 10000.00 Alex0001 Bond 2000.00
With two limits for the same product, the question is which limit takes precedence. In our example query, we used the MIN function for the lower limit to take precedence over the higher limit. If you wanted the higher limit to take precedence, you would make that happen simply by using MAX instead of MIN. The GROUP BY clause is a requirement when using aggregate functions in this manner and ensures that only one permission is ultimately returned for each product that the trader is authorized to trade.
Revoking permissions for a group
The Status column in our design allows you to quickly and efficiently revoke a specific permission from a group or to suspend all permissions for a given group. For example, to suspend all trading permissions for the Debt group, you could execute the following statement:
UPDATE GroupPermissions SET Status='S' WHERE GroupId='Debt'
The previously shown query for checking permissions only looks at
valid permissions (
'V'); those you’ve just suspended
'S') will automatically be excluded. Their
definitions, however, will still exist in the database, so you can
easily enable them again when you want to do so.
The solution shown in this recipe is a general solution and can be used for almost any type of permission hierarchy with a limited number of levels. It can be used for security hierarchies where different levels of security authorization are defined, for storing information regarding types of users and their authorizations in a web-based authorization system, or for any other type of permission hierarchy where a large number of users or accounts is used.
Changing Individual Permissions
You’ve implemented the permission solution described in the previous section, and you are happy with the ability to assign permissions to groups of users, but you also want the ability to grant specific exceptions directly to specific traders.
The solution is twofold. First, create an additional table to record exceptional permissions granted directly to a trader. Second, modify the query that you use to obtain the permissions that are currently valid for a given trader. The additional table could look like the following AccountPermissions table:
CREATE TABLE AccountPermissions( AccountId VARCHAR(20) NOT NULL, ProductType VARCHAR(20) NOT NULL, Status CHAR(1) CHECK(Status in ('V','S')) DEFAULT('V'), Limit NUMERIC(10,2) NULL, PRIMARY KEY(AccountId, ProductType) )
In this table, insert records to record all the permissions that you want to assign to traders that are in addition to the permissions those users inherit from their groups. For example, suppose that Alex0001 had a share limit of 1,000 because he belonged to the equities group. Further suppose that you wanted to make Alex0001 an exception to the rule for that group and raise his share limit to 5,000. To do this, insert the following row into the AccountPermissions table:
AccountId ProductType Status Limit -------------------- -------------------- ------ ------------ Alex0001 Share V 5000.00
With the AccountPermissions table in place, you need to modify the query that you use to retrieve permissions for a given user. That query now needs to take into account this new table. The query in the following example, which returns permissions for Alex0001, will do that:
SELECT m.AccountId, g.ProductType, CASE WHEN isnull(MIN(a.Limit),0) > MIN(g.Limit) THEN MIN(a.Limit) ELSE MIN(g.Limit) END Limit FROM GroupMembership m JOIN GroupPermissions g ON (m.GroupId=g.GroupId) LEFT OUTER JOIN AccountPermissions a ON (m.AccountId=a.AccountId AND g.ProductType=a.ProductType) WHERE m.AccountId='Alex0001' GROUP BY m.AccountId, g.ProductType HAVING MIN(g.status)='V' AND isnull(MIN(a.status),MIN(g.status))='V' AccountId ProductType Limit -------------------- ----------- ------------ Alex0001 Bill 10000.00 Alex0001 Bond 2000.00 Alex0001 Future 200.00 Alex0001 Option 100.00
Alex0001 Share 5000.00
Note the emphasized line in the output. That line represents the permission to trade 5,000 shares that came from the AccountPermissions table.
The idea behind the permissions model described in this chapter is to have permissions set through group memberships whenever possible. However, in practice, you’ll often find that it is useful to have a tuning facility of sorts that allows you to directly change a specific permission for a specific user. To that end, we created the additional table named AccountPermissions in which such exceptions are recorded. Our new query joins the GroupPermissions table with the GroupMembership table to return permissions set at the group level. Those results are then joined with the AccountPermissions table, which adds account-specific exceptions to the result set. An outer join is used for this third table to make the account-specific exceptions optional.
The GROUP BY clause is again used in this new version of the query, for the same reason as before — we only want one permission to be returned for each product type. There is one difference, though: this time, the GROUP BY function includes more columns in its column list:
GROUP BY m.AccountId, g.ProductType
The CASE statement that you see in the query is used to decide which value to take if both individual and group permissions for the same account and product are present. It checks both values and reports just one:
CASE WHEN isnull(MIN(a.Limit),0) > MIN(g.Limit) THEN MIN(a.Limit) ELSE MIN(g.Limit) END Limit
In our case, our authorization policy is that account-specific permissions only take precedence over permissions granted at the group level when the account-specific limit is greater than the group-specific limit. The isnull( ) function takes care of cases where individual permissions are not set. It does this by supplying a zero value for those cases. Using a CASE statement like this is a very flexible approach, because you can easily implement different authorization policies. It would be trivial, for example, to change the CASE statement such that account-specific permissions always took precedence over group-level permissions, regardless of whether the account-specific permission specified a higher or lower limit.
Unlike the query shown previously in this chapter in which the Status flag was checked in the WHERE clause, this query takes the different approach of checking the Status flag in the HAVING clause. In fact, the query checks to be sure that both flags — for the group-level permission and for the account-specific permission — are set:
HAVING MIN(g.status)='V' AND isnull(MIN(a.status),MIN(g.status))='V'
In this case, if only one of the two applicable flags is set to
'S', the permission is revoked.
The solution shown in this recipe is very useful when you need to make exceptions for existing permissions set at the group level. However, it has one significant problem: you cannot define an account-specific permission for a trader in cases where a permission for the same product has not also been granted at the group level.
Adding New Individual Permissions
The previous recipe’s solution allows you to define individual exceptions to permissions that a trader already has at the group level. However, you wish to grant permissions to specific traders without regard to whether permissions for the same products have been granted at the group level.
In the previous recipe, you could define an exception to a permission that a trader had already been granted at the group level. For example, you could create the following two rows in AccountPermissions:
AccountId ProductType Status Limit -------------------- -------------------- ------ ------------ Alex0001 Share V 5000.00 Betty0002 Share V 8000.00
The query in the previous recipe, however, will only respect these account-specific permissions in cases where the traders in question have also been granted permission to trade shares at the group level. This limitation, if you wish to call it that, comes about as a result of the three-way join used in the query.
You may wish account-specific permissions to take effect all the time. To do that, you can use the same data model as shown in the previous recipe, but with a different query. The query shown in the following example is a union query and correctly returns both group-level and account-specific permissions for Betty0002:
SELECT m.AccountId, g.ProductType, MIN(g.Limit) Limit FROM GroupMembership m JOIN GroupPermissions g ON m.groupId=g.groupId WHERE Status='V' AND AccountId='Betty0002' AND NOT EXISTS(SELECT * FROM AccountPermissions a WHERE m.AccountId=a.AccountId AND g.ProductType=a.ProductType) GROUP BY m.AccountId, g.ProductType UNION SELECT a.AccountId, a.ProductType,a.Limit FROM AccountPermissions a WHERE a.AccountId='Betty0002' AND a.Status='V' AccountId ProductType Limit -------------------- -------------------- ------------ Betty0002 Share 8000.00 Betty0002 Future 200.00 Betty0002 Option 100.00
As you can see, even though Betty0002 has not been granted permission to trade shares at the group level, this query still picked up the account-level share permissions. The query in the previous recipe won’t do that.
The key to this query is that it is a union query. The first query in the union reports all those permissions that are defined only at the group level. The subquery in that query’s WHERE clause ensures that group-level permissions for products are excluded when account-specific permissions exist for those same products. The second query in the union then returns account-specific permissions. The results from the two queries are combined as a result of the UNION clause.
There are two drawbacks to the solution shown in this recipe. One is that this solution is less efficient than the one shown in the previous recipe. This is because there are three SELECT statements involved instead of just one. Another drawback is that this solution is inflexible in terms of which permission to use when permission for the same product is granted at both the group and the account level. In such cases, the account-specific permission always takes precedence over the group-level permission. A way to overcome this latter limitation is to create a more general UNION query that includes both group- and account-level permissions, embed that query in a view, and then manipulate the view with an appropriate query. The following statement creates such a view:
CREATE VIEW Permissions AS SELECT m.AccountId, g.ProductType, MIN(g.Limit) Limit FROM GroupMembership m JOIN GroupPermissions g ON m.groupId=g.groupId WHERE Status='V' GROUP BY m.AccountId, g.ProductType UNION SELECT a.AccountId, a.ProductType,a.Limit FROM AccountPermissions a WHERE a.Status='V'
This VIEW returns the group permissions expanded for each account and also includes any account-specific permissions that may exist. To list all permissions for a particular account, query the view and apply your interpretation policy in the query. For example:
SELECT ProductType, MIN(Limit) Limit FROM permissions WHERE AccountId='Alex0001' GROUP BY ProductType ProductType Limit -------------------- ------------ Bill 10000.00 Bond 2000.00 Future 200.00 Option 100.00 Share 1000.00
This query lists permissions for Alex0001. The MIN function resolves cases where multiple permissions exist for the same product type. When such cases occur, MIN ensures that only the lowest applicable limit is returned. To always return the highest-applicable limit, you could use the MAX function.
Centralizing Authorization Logic
You wish to centralize the logic for your authorization policy. You want to be able to query for an individual trader’s permissions, but you don’t want that query to contain the logic that handles conflicting permissions and individual exceptions. Your goal is to be able to make reasonable changes to your permissions policies and model without forcing a change to queries already running in production programs.
Create a view into which you embed an authorization query. The benefit of this is that you can easily change the underlying query of the view without having to change the code in all the programs that query the view. The following view implements the authorization logic shown in the first recipe titled “Creating a Permission Hierarchy”:
CREATE VIEW orderAuthorization AS SELECT AccountId, ProductType, MIN(Limit) Limit FROM GroupMembership m JOIN GroupPermissions g ON m.groupId=g.groupId WHERE Status='V' GROUP BY AccountId, ProductType
Using this view, you can now obtain the permissions for a specific trader by executing a simple query such as the one shown in the following example:
SELECT AccountId, ProductType, Limit FROM OrderAuthorization WHERE AccountId = 'Alex0001' AccountId ProductType Limit -------------------- ----------- ------------ Alex0001 Bill 10000.00 Alex0001 Bond 2000.00 Alex0001 Future 200.00 Alex0001 Option 100.00 Alex0001 Share 1000.00
You can also issue a query against this view to check for a specific permission. For example, suppose an order to trade 3,000 shares for the account Alex0001 comes into the system. You can use the simple query shown in the following example to retrieve the maximum limit for this type of order:
SELECT Limit FROM orderAuthorization WHERE AccountId='Alex0001' AND productType='Share' Limit ------------ 1000.00
By keeping your queries simple and writing them against a view such as this, you can easily modify your permission model and logic. In the event that you make such modifications, the view will insulate your existing queries from the changes.
This solution demonstrates a good separation of business logic from the data model. Imagine that these permissions can be any type of permissions for a large amount of accounts. Having the interpretation of the data model embedded within a view is very likely to pay off for you. Changes in real-life systems are frequent, and often unexpected, so it is usually not wise to hand-code the interpretation (embedding the query) of a data model into your application system. With the view-based solution shown in this recipe, you are free to change the underlying permission model and the interpretation policy during production without going through the additional trouble of recompiling the system or changing the code.
Implementing General Hierarchies
Use the general parent/child model described earlier in the project-tracking example in Section 4.1 as the basis for the project-tracking system. In this model, a project is composed of subprojects or tasks, tasks may be broken into subtasks, and subtasks may themselves be broken up into subtasks. There is no limit to the level of nesting that may occur, and it all occurs within one database table named Projects. Conceptually, a project is simply the highest-level task. The Projects table is defined as follows:
CREATE TABLE Projects( Name VARCHAR(20), Cost INTEGER, Parent INTEGER, VertexId INTEGER, Primary key(VertexId) )
With this solution in mind, let’s look at a few problems that are common to hierarchical structures, such as the one used here to define projects:
List all leaf nodes
List all nodes that are not leaf nodes
Find the most expensive vertices
List the immediate children of a node
Find the root
The next few sections present SQL Server solutions to each of these problems.
List all leaf nodes
SELECT Name FROM Projects p WHERE NOT EXISTS( SELECT * FROM Projects WHERE Parent=p.VertexId)
The result of executing this query is a list of all subprojects (or tasks if you prefer) that have no children:
Name -------------------- Interviews Drafts Consolidations Presentation UI Design Correctness Testing Database UI Implementation Coding Initial Testing Final Adjustments Production Testing
List all nodes that are not leaf nodes
SELECT Name FROM Projects p WHERE EXISTS( SELECT * FROM Projects WHERE Parent=p.VertexId)
This query lists all nodes with children:
Name -------------------- New SW Specifications Final Document Prototype Calculating Development Beta Testing
Find the most expensive vertices
Each task in the project has a cost associated with it. Each task is also a vertex. To find the N most expensive tasks, simply query the table, order the results by cost, and use the TOP operator to limit the results to the top N rows. For example:
SELECT TOP 5 Name, Cost FROM Projects ORDER BY cost DESC Name Cost -------------------- ----------- Inital testing 40 Beta testing 40 Development 30 Coding 20 Production testing 20
This is a good example of a case where it is more productive to issue straight SQL queries, rather than attempt to navigate the hierarchy.
Find the immediate children of a node
SELECT Name FROM Projects WHERE Parent=( SELECT VertexId FROM Projects WHERE Name='Specifications' ) Name -------------------- Interviews Drafts Consolidations Final document
Only immediate children are returned. The parent record for each of the four records shown here is the record for the “Specifications” task. These records may, themselves, have children, but those children won’t show up in the results for this query.
Find the root
The root is the vertex with no parents. The query to return it is simple, because all you need is to return the one node with no parent pointer. For example:
SELECT Name FROM Projects WHERE Parent=0 Name -------------------- New SW
In our example, we use a value of zero to indicate that a node does
not have a parent. You could just as easily use a NULL to indicate
the same thing, in which case you would specify
NULL as your WHERE clause condition.
As you can see, these queries work on a single level where, at most, they refer to one level above (to parents) or below (to children). For these types of queries, you don’t need any additional algorithms or data structures to retrieve information. The queries are pretty straightforward, and you just have to remember the definition of your hierarchy and its elements to write them.
Traversing Hierarchies Recursively
One solution is to use a recursive algorithm to traverse the tree. To do that, encapsulate the algorithm in a stored procedure. The following code creates a stored procedure named TraversProjectsRecursive. The stored procedure takes one argument, specifying from which node to begin. The procedure then works its way down through that node’s children, grandchildren, and so forth.
CREATE PROCEDURE TraverseProjectsRecursive @VertexId INTEGER AS /* to change action on each vertex, change these lines */ DECLARE @Name VARCHAR(20) SELECT @Name=(SELECT Name FROM Projects WHERE VertexId=@VertexId) PRINT SPACE(@@NESTLEVEL*2)+STR(@VertexId)+' '+@Name /* ****** */ DECLARE subprojects CURSOR LOCAL FOR SELECT VertexId FROM Projects WHERE Parent=@VertexId OPEN subprojects FETCH NEXT FROM subprojects INTO @VertexId WHILE @@FETCH_STATUS=0 BEGIN EXEC TraverseProjectsRecursive @VertexId FETCH NEXT FROM subprojects INTO @VertexId END CLOSE subprojects DEALLOCATE subprojects
When calling the procedure, you need to specify the initial node from which you want the procedure to start traversing. If you want to print out the entire hierarchy, then specify the root node (in our case 1). For example:
TraverseProjectsRecursive 1 1 New SW 2 Specifications 3 Interviews 4 Drafts 5 Consolidations 6 Final document 7 Presentation 8 Prototype 9 UI Design 10 Calculations 11 Correctness Testing 12 Database 13 Development 14 UI Implementation 15 Coding 16 Inital testing 17 Beta testing 18 Final adjustments 19 Production testing
The algorithm used by the TraverseProjectsRecursive procedure is a classical tree-traversal method adapted to the specifics of SQL. You can easily adapt this procedure for use with other hierarchical structures besides the projects structure shown in this chapter. To adapt this procedure, change only the SELECT and PRINT statements.
When you create a recursive procedure such as the one shown here, MS SQL Server will warn you with a message such as the following: “Cannot add rows to sysdepends for the current stored procedure because it depends on the missing object ‘TraverseProjectsRecursive.’ The stored procedure will still be created.'' Do not worry about this message. Recursive procedures are a special case, and you can safely ignore the warning.
The first SELECT statement in the procedure retrieves the name associated with the vertex that is passed in as a parameter. That name is placed into the @Name variable, which is then used in the subsequent PRINT statement:
PRINT SPACE(@@NESTLEVEL*2)+STR(@VertexId)+' '+@Name
In this PRINT statement, we use the @@NESTLEVEL variable. That variable is maintained automatically by SQL Server and returns the current nesting level. This information is a great advantage when working with stored procedures. In our case, we want to indent every project proportionally to its depth. To do that, we multiply the nesting level by two to indent each child two spaces underneath its parent.
After we print the name of the current vertex, we need to go forward to its children. A vertex can have more than one child, so we define a cursor to select these:
DECLARE subprojects CURSOR LOCAL FOR SELECT VertexId FROM Projects WHERE Parent=@VertexId
This cursor definition uses the LOCAL clause, ensuring that the cursor is defined for the current procedure only. By default, cursors are global, which means that any cursor defined is visible from all code executed within the current connection. Since this procedure calls itself repeatedly within the context of one connection, we must specify that this cursor definition be local to each invocation of the procedure.
After opening the subprojects cursor, we simply step through the list of subprojects and recursively invoke the TraverseProjectsRecursive procedure on each one:
FETCH NEXT FROM subprojects INTO @VertexId WHILE @@FETCH_STATUS=0 BEGIN EXEC TraverseProjectsRecursive @VertexId FETCH NEXT FROM subprojects INTO @VertexId END
Even though the recursive mechanisms are very elegant, they are not very efficient. Furthermore, in SQL Server they have a limit of only 32 nesting levels. If you query your hierarchy infrequently, then the overhead of recursion may be acceptable, and you need not bother with other mechanisms. However, if you do this type of thing often, some further optimizations might be necessary to give you adequate performance. Sometimes it’s useful to use recursion for prototyping, because recursive solutions are simple to code. Later, you can optimize your code with a nonrecursive solution before you go to production.
Manipulating Hierarchies Recursively
You want to add and delete vertices and their subtrees. When you hire a new employee or open a new project in your company, you need to add this information into your system. To maintain the hierarchical structure of your employee information, special procedures must be followed.
To add a vertex to a hierarchy, you simply insert the row into the table with the proper parent pointer. No other action is necessary. If the vertex you just added has children, you insert those children in the same manner. No additional manipulations are needed.
Things become more complex when you need to delete a vertex, because you not only need to delete the row for the vertex, you need to delete any rows that represent children as well. For this purpose, you can use a modified version of the traversing algorithm shown in the previous recipe. The following RemoveProjectsRecursive procedure will delete a specified vertex and all vertices that fall underneath it:
CREATE PROCEDURE RemoveProjectsRecursive @VertexId INTEGER AS SET NOCOUNT ON DECLARE @LocalVertex INTEGER SELECT @LocalVertex=@VertexId DECLARE subprojects CURSOR LOCAL FOR SELECT VertexId FROM Projects WHERE Parent=@VertexId OPEN subprojects FETCH NEXT FROM subprojects INTO @LocalVertex WHILE @@FETCH_STATUS=0 BEGIN EXEC RemoveProjectsRecursive @LocalVertex FETCH NEXT FROM subprojects INTO @LocalVertex END CLOSE subprojects DEALLOCATE subprojects DELETE FROM Projects WHERE VertexId=@VertexId PRINT 'Vertex ' +CONVERT(VARCHAR(8),@VertexId) + ' removed!'
To delete a vertex, simply invoke the RemoveProjectsRecursive procedure and pass in the vertex ID as a parameter. In the following example, the procedure is used to delete the vertex for Specifications, which happens to be vertex 2:
RemoveProjectsRecursive 2 Vertex 3 removed! Vertex 4 removed! Vertex 5 removed! Vertex 7 removed! Vertex 6 removed! Vertex 2 removed!
Six rows were deleted as a result of deleting the Specifications vertex. Four of the rows are for the four children, one row is for a child of a child, and the sixth row is for the Specifications vertex, itself. This procedure always deletes the parent nodes last to avoid any foreign-key violations.
Obviously, adding new members to a hierarchy is fairly easy since a hierarchical structure does not have any limitations on the number of children or levels of hierarchy that you can have. Inserting a new row under a parent is easy.
For deleting, we used the traversing algorithm from the previous recipe. When we adapted that algorithm for deleting a vertex, we were careful to have it remove all children before finally removing a parent node. If a parent is removed before its children, the table would contain nodes with no parents. Such nodes would need to be collected and eliminated using a specially coded garbage-collection mechanism. While such a mechanism is possible, you have the problem of how to differentiate valid parent-free nodes (project roots) from the orphan nodes that remain because you deleted their parents.
The algorithm shown in this recipe is fairly efficient; it removes leaf nodes and parent nodes with the same procedure, and it does not result in significant overhead if you are removing just a leaf vertex.
You want to aggregate hierarchical data. For example, you wish to sum the cost of a project or task, beginning from a specific vertex and working your way down through all levels of the hierarchy underneath that vertex.
With respect to the projects example that we have been using for these general hierarchy recipes, you can use the following stored procedure to summarize the costs of all subprojects for a specific project. You specify the project by passing its vertex ID as a parameter to the procedure.
CREATE PROCEDURE AggregateProjects @VertexId INTEGER AS SET NOCOUNT ON DECLARE @lvl INTEGER DECLARE @Name VARCHAR(20) DECLARE @Cost INTEGER DECLARE @Sum INTEGER CREATE TABLE #stack ( VertexId INTEGER, Name VARCHAR(20), Cost INTEGER, Lvl INTEGER ) SELECT @Sum=0 SELECT @Lvl = 1 INSERT INTO #stack SELECT VertexId, Name, Cost, 1 FROM Projects WHERE VertexID=@VertexId WHILE @Lvl > 0 BEGIN IF EXISTS (SELECT * FROM #stack WHERE lvl = @lvl) BEGIN SELECT TOP 1 @VertexId = VertexId, @Name=Name, @Cost=Cost FROM #stack WHERE lvl = @lvl ORDER BY VertexId /* to change action each vertex, change this line */ SELECT @Sum=@Sum+@Cost /* ******* */ DELETE FROM #stack WHERE vertexId = @VertexId INSERT #stack SELECT VertexId, Name, Cost, @lvl + 1 FROM Projects WHERE parent = @VertexId IF @@ROWCOUNT > 0 SELECT @lvl = @lvl + 1 END ELSE SELECT @lvl = @lvl - 1 END PRINT 'Sum of project costs is '+STR(@Sum)+'.' DROP TABLE #stack SET NOCOUNT OFF
The following examples show the results of executing this procedure for all projects (VertexId=1) and for the Specifications project (VertexId=2).
AggregateProjects 1 Sum of project costs is 231. AggregateProjects 2 Sum of project costs is 33.
The algorithm shown here is adapted from the SQL Server Books Online recommendation for expanding hierarchies. Those familiar with algorithms and data structures will notice that it is a nonrecursive implementation of a recursive traversing algorithm. The code uses an internal stack implemented as a temporary table. The stack holds interesting information from the hierarchical Projects table, plus an additional column named lvl. The lvl column records the level that each entry holds in the hierarchy. The stack definition is as follows:
CREATE TABLE #stack ( VertexId INTEGER, Name VARCHAR(20), Cost INTEGER, Lvl INTEGER )
After the #stack table is created, two variables are initialized. The @lvl variable tracks the current level on which the code is operating, while the @sum variable accumulates the sum of all costs. Next, an INSERT statement is used to store the root onto the stack. Note that the root in our case is the vertex identified by @VertexId.
The code then loops through each element on the stack. The loop begins by popping one element from the stack. The retrieved data contains a cost, which is added to the total cost being accumulated in @sum:
SELECT TOP 1 @VertexId = VertexId, @Name=Name, @Cost=Cost FROM #stack WHERE lvl = @lvl ORDER BY VertexId SELECT @Sum=@Sum+@Cost
After accumulating the cost for the vertex just pulled from the stack, the code deletes that read vertex and adds all of its children to the stack:
DELETE FROM #stack WHERE vertexId = @VertexId INSERT #stack SELECT VertexId, Name, Cost, @lvl + 1 FROM Projects WHERE parent = @VertexId IF @@ROWCOUNT > 0 SELECT @lvl = @lvl + 1
The IF statement that you see in this code ensures that if the vertex just deleted from the stack has any children, the @lvl value is increased to indicate that the code is moving one level down in the hierarchy.
The IF EXISTS clause at the beginning of the loop ensures that so long as there are some candidates available on the current level, the loop repeats and browses through them all. The SET NOCOUNT ON directive at the beginning of the procedure just limits the number of messages displayed from the procedure. It does not affect the logic of the algorithm. Without SET NOCOUNT ON, you’ll see a steady stream of “(1 row(s) affected)” messages as the code executes.
This algorithm is general enough that it can be used for any kind of operation for which you prefer traversing the hierarchy in a nonrecursive manner. If you want to change the operation that is performed on each node, change the code where current cost is added to the total sum.
The code demonstrates why the general model for hierarchies has a limited application in SQL. When you need to traverse over more than one level efficiently, the code starts to expand to an almost unreadable size.
Preparing Multilevel Operations
One solution is to store additional accessibility information into a service table, and then use that information when querying your hierarchical table. For example, the following ProjectPaths table records the path to, and the depth of, each vertex in the Projects table:
CREATE TABLE ProjectPaths( VertexId INTEGER, Depth INTEGER, Path VARCHAR(300) )
After creating the ProjectPaths table, you can use the following procedure to fill the table with the depth and path information for each vertex in the Projects table:
CREATE PROCEDURE BuildProjectPathsRecursive @VertexId INTEGER AS SET NOCOUNT ON DECLARE @Path VARCHAR(300) DECLARE @Depth INTEGER SELECT @Depth=a.Depth,@Path=a.Path FROM ProjectPaths a JOIN Projects p ON p.parent=a.vertexId WHERE @vertexId=p.vertexId DELETE FROM ProjectPaths WHERE VertexId=@VertexId INSERT INTO ProjectPaths VALUES( @VertexId, isnull(@Depth,0)+1, isnull(@Path,'.')+CAST(@VertexId AS VARCHAR(15))+'.') DECLARE subprojects CURSOR LOCAL FOR SELECT VertexId FROM Projects p WHERE Parent=@VertexId OPEN subprojects FETCH NEXT FROM subprojects INTO @VertexId WHILE @@FETCH_STATUS=0 BEGIN EXEC BuildProjectPathsRecursive @VertexId FETCH NEXT FROM subprojects INTO @VertexId END CLOSE subprojects DEALLOCATE subprojects SET NOCOUNT OFF
This procedure takes one parameter, which tells the procedure with which node to start. The procedure then works its way down the hierarchy. To process all nodes in the Projects table, invoke this procedure and pass a value of 1 as follows:
The procedure fills the ProjectPaths table with additional information for every vertex. The Depth column records the depth of each vertex. The Path column records the path to each vertex. In the path, the vertex numbers are separated by dots. The ProjectPaths table that will be built contains the following rows:
VertexId Depth Path ----------- ----------- ----------- 1 1 .1. 2 2 .1.2. 3 3 .1.2.3. 4 3 .1.2.4. 5 3 .1.2.5. 6 3 .1.2.6. 7 4 .18.104.22.168. 8 2 .1.8. 9 3 .1.8.9. 10 3 .1.8.10. 11 4 .22.214.171.124. 12 3 .1.8.12. 13 2 .1.13. 14 3 .1.13.14. 15 3 .1.13.15. 16 3 .1.13.16. 17 2 .1.17. 18 3 .1.17.18. 19 2 .1.19.
The idea for this recipe has been taken from an article published by Itzik Ben-Gan (SQL Server Magazine, June, 2000). His development of this technique is a recent achievement resulting from his search for an ultimate support structure to improve the efficiency of the classical hierarchical model. Although it was originally promoted as an add-on to an existing hierarchy table, we see no reason why you shouldn’t normalize properly and separate the hierarchy and its data from the support structure.
The path leading to every vertex is stored in the ProjectPaths table. This represents the work of traversing the hierarchy, and because it is stored in the ProjectPaths table, it only needs to be done once. Please note that the length of the Path field can be changed according to your needs. It does, however, make sense to keep it reasonably small, especially if you want to index it.
The stored procedure named BuildProjectPathsRecursive fills the ProjectPaths table with the paths to each vertex in the subtree. It uses the recursive traversal algorithm introduced earlier in this chapter and runs the following code for each vertex:
SELECT @Depth=a.Depth,@Path=a.Path FROM ProjectPaths a JOIN Projects p ON p.parent=a.vertexId WHERE @vertexId=p.vertexId DELETE FROM ProjectPaths WHERE VertexId=@VertexId INSERT INTO ProjectPaths VALUES( @VertexId, isnull(@Depth,0)+1, isnull(@Path,'.')+CAST(@VertexId AS VARCHAR(15))+'.')
The SELECT statement reads the depth and path data from the parent. Next, any old information for the vertex is deleted from the ProjectPaths table, and new data is inserted. If the @Depth or @Path variables are null, indicating that no access path for the parent exists, then an initial value of 0 is set for the depth, and an initial value of a dot (.) is set for the path. Regardless of how the depth gets set, it is increased by one. That’s because the @Depth variable represents the depth of the current node’s parent. You have to increment that depth by 1 to get the current node’s depth. Similarly, the @Path variable contains the path to the parent. The current vertex ID is appended onto that path to yield the path to the current node. These new depth and path values are then inserted into the ProjectPaths table.
If you prefer nonrecursive algorithms, you can rewrite the recursive BuildProjectPathsRecursive procedure as a nonrecursive procedure. This code is as follows and uses the stack-based technique shown earlier in the recipe titled Section 4.9:
CREATE PROCEDURE BuildProjectsPaths @VertexId INTEGER AS SET NOCOUNT ON DECLARE @lvl INTEGER CREATE TABLE #stack ( VertexId INTEGER, Lvl INTEGER ) SELECT @Lvl = 1 INSERT INTO #stack SELECT VertexId,1 FROM Projects WHERE VertexId=@VertexID WHILE @Lvl > 0 BEGIN IF EXISTS (SELECT * FROM #stack WHERE lvl = @lvl) BEGIN SELECT TOP 1 @VertexId = VertexId FROM #stack WHERE lvl = @lvl ORDER BY VertexId DELETE FROM ProjectPaths WHERE VertexId=@VertexId INSERT INTO ProjectPaths SELECT p.vertexId, isnull(a.Depth,0)+1, isnull(a.Path,'.')+CAST(p.VertexId AS VARCHAR(15))+'.' FROM ProjectPaths a,Projects p WHERE @vertexId=p.vertexId AND p.parent*=a.vertexId DELETE FROM #stack WHERE vertexId = @VertexId INSERT #stack SELECT VertexId, @lvl + 1 FROM Projects WHERE parent = @VertexId IF @@ROWCOUNT > 0 SELECT @lvl = @lvl + 1 END ELSE SELECT @lvl = @lvl - 1 END SET NOCOUNT OFF
Aggregating Hierarchies Revised
You want to perform aggregation on your hierarchy. As before, you wish to sum the cost of a project or task by beginning from a specific vertex and working your way through all levels of the hierarchy. This time, though, you wish to enhance the efficiency of your aggregation by using the ProjectPaths service table created in the previous recipe. In addition to summarizing the cost, you also wish to list the hierarchy in an indented format.
Recall that the previous aggregation procedure was fairly complex and made use of a temporary table named #stack. With the ProjectsPaths table, that same aggregation process becomes simple enough that you can perform it with the SQL query shown in the following example:
SELECT SUM(cost) Total FROM ProjectPaths a JOIN Projects p ON a.VertexId=p.VertexId WHERE Path LIKE (SELECT Path FROM ProjectPaths WHERE VertexId=1)+'%' Total ----------- 231
The query in this example summarizes the costs of all projects and tasks under VertexId 1. As you can see, the result of 231 was obtained without the need for recursion and without the need for a temporary stack table. It was obtained with only a four-line SELECT statement, as opposed to the 48 lines of procedural code required for the earlier version of the aggregation solution.
You can also make use of the ProjectPaths table to list the project in a hierarchical manner:
SELECT Space(Depth*2)+Name Project FROM ProjectPaths a JOIN Projects p ON a.VertexId=p.VertexId WHERE Path LIKE (SELECT Path FROM ProjectPaths WHERE VertexId=1)+'%' ORDER BY a.Path Project --------------------------------- New SW Development UI Implementation Coding Initial testing Beta testing Final adjustments Production testing Specifications Interviews Drafts Consolidations Final document Presentation Prototype Calculations Correctness Testing Database UI Design
Again, the ProjectPaths table enabled the desired result to be generated using only a short SQL query, as opposed to the rather long procedure that would otherwise be required. Please note that the order in which tasks are listed in the result might not be the same as you get with the TraverseProjectsRecursive procedure. However, the hierarchical structure of the information is still preserved.
The first query joins the ProjectPaths and Projects tables. This is a one-to-one join, since both tables have an equal number of rows. The secret to the query lies in the second part of the WHERE clause:
Path LIKE (SELECT Path FROM ProjectPaths WHERE VertexId=1)+'%'
The WHERE clause gathers all vertices for which the beginning of the path string is equal to the root vertex (in our case, it is .1.). The summation of all costs is then just a simple matter of applying the SUM function to those rows.
Multilevel operations can now be performed efficiently using the ProjectPaths table. Once you know the path to the parent node, you know the paths to all of that node’s children. Had you wished to summarize the cost for the Specifications subproject, you could modify the second part of the WHERE clause as follows:
Path LIKE (SELECT Path FROM ProjectPaths WHERE VertexId=2)+'%'
When writing queries using a table like the ProjectPaths table, you
need to remember two rules. First, if you wish to perform an
operation on a parent vertex together with all its children, you
should use the
% pattern match operator at the end
of the search string in your LIKE predicate. Second, if you wish to
exclude the parent from the result set, you should use the
_% pattern. The additional underscore in the
pattern match string requires that a character
be present. Thus, if the parent’s path is .1., it
will not match a pattern of .1._%. Any children, however, will have a
character following the second dot, so they will
match the pattern.
The Depth column in the ProjectPaths table allows you to zero in easily on vertices of a given depth. For example, the following query will return a list of all level two projects in the Projects table:
SELECT SUM(cost) Total FROM ProjectPaths a JOIN Projects p ON a.VertexId=p.VertexId WHERE a.Depth=2
The Depth column can also be used to compute indention, as you saw earlier in this recipe’s second query:
SELECT Space(Depth*2)+Name Project