Continuing the series:
What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?
For detailed explanations of the terms, see the first article in the series:
Today is Oracle time.
Adjacent sets require recursion, and Oracle has a native way to implement recursion in SQL statements. To do this it uses a special clause, CONNECT BY
.
Unlike other systems, Oracle always supported recursion in SQL statements since version 2 (this is the name the first version of Oracle was released under).
This was done because by that time there already were several hierarchical databases on the market which are quite good for storing hierarchical data. If not for this clause, transition from a hierarchical DBMS to the relational DBMS would be too hard.
Oracle's way to query self-relations is somewhat different from recursive CTE's that other systems use:
- Recursive CTEs can use an arbitrary set in recursive steps. The recursion base and recursion stack are not visible in further steps. The recursive operation is a set operation (usually a
JOIN
) on the recursion parameter (a set) and the result of the previous operation:
CONNECT BY
queries can only use the value of a single row in recursive steps.
The recursion base and recursion stack are visible, the values outside the recursion stack are not.
The recursive operation is a set operation on each row of the current row and the base rowset.
Unlike recursive CTE, CONNECT BY
operates row-wise, that is each row of the result spans its own recursion branch with its own stack:
The main difference it that the CONNECT BY
operation cannot produce anything that was not in the original rowset, while a recursive CTE can.
However, CONNECT BY
has numerous benefits too: it's very fast, allows to track the hierarchy path easily (using a built-in funtion named SYS_CONNECT_BY_PATH
) and implicitly sorts the recordset in tree order (with additional clause ORDER SIBLINGS BY
to sort the siblings).
This method, of course, allows traversing adjacency lists easily.
To compare efficiency of adjacency lists and that of nested sets, let's create a sample table:
Table creation details
CREATE TABLE t_hierarchy (id NOT NULL, parent NOT NULL, lft NOT NULL, rgt NOT NULL, data NOT NULL, stuffing NOT NULL)
AS
WITH nums AS
(
SELECT level AS num, 100 AS value
FROM dual
CONNECT BY
level <= 5
),
digits AS
(
SELECT SYS_CONNECT_BY_PATH(num, '/') AS digit, level AS len
FROM nums
CONNECT BY
level <= 9
)
SELECT id,
FLOOR((id - 1) / 5),
lft,
lft + FLOOR(2929687 / POWER(5, len)) - 1,
CAST('Value ' || id AS VARCHAR2(100)),
CAST(RPAD('*', 100, '*') AS VARCHAR2(100))
FROM (
SELECT digit, len,
(
SELECT SUM(SUBSTR(digit, level * 2, 1) * POWER(5, len - level))
FROM dual
CONNECT BY
level <= len
) AS id,
(
SELECT SUM((SUBSTR(digit, level * 2, 1) - 1) * FLOOR(2929687 / POWER(5, level)) + 1)
FROM dual
CONNECT BY
level <= len
) AS lft
FROM digits
)
/
ALTER TABLE t_hierarchy ADD CONSTRAINT pk_hierarchy_id PRIMARY KEY (id)
/
CREATE INDEX ix_hierarchy_parent ON t_hierarchy (parent)
/
CREATE INDEX ix_hierarchy_lft ON t_hierarchy (lft)
/
CREATE INDEX ix_hierarchy_rgt ON t_hierarchy (rgt)
/
The table contains both adjacency list data and nested sets data, with 8 levels of nesting, 5 children of each parent node and 2,441,405 records.
As in the previous articles, we'll test performance of the three most used queries:
- Find all descendants of a given node
- Find all ancestors of a given node
- Find all descendants of a given node up to a certain depth
All descendants
Nested sets
SELECT SUM(LENGTH(hc.stuffing))
FROM t_hierarchy hp
JOIN t_hierarchy hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = 42
View query details
SUM(LENGTH(HC.STUFFING)) |
1953100 |
1 row fetched in 0.0001s (0.0144s) |
SELECT STATEMENT
SORT AGGREGATE
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_LFT
This is super fast, faster than in any system before that: only 10 ms. The plan is quite predictable: a UNIQUE
scan to find the parent record and a range scan to find all children.
Adjacency list
SELECT SUM(LENGTH(stuffing))
FROM t_hierarchy
START WITH
id = 42
CONNECT BY
parent = PRIOR id
View query details
SUM(LENGTH(STUFFING)) |
1953100 |
1 row fetched in 0.0001s (0.1479s) |
SELECT STATEMENT
SORT AGGREGATE
CONNECT BY WITH FILTERING
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
NESTED LOOPS
BUFFER SORT
CONNECT BY PUMP
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_PARENT
TABLE ACCESS FULL, 20090928_nested.T_HIERARCHY
This is quite fast too: 140 ms. The plan loops over the parent
ranges for each id
.
Note the TABLE ACCESS FULL
in the end of the plan. Oracle reserves a right to do a FULL SCAN
instead of a RANGE SCAN
if it considers a certain parent
range too large to use the index. However, this is recursion, not a single operation, and it's impossible to tell in advance which id
values will be returned on each recursion step. That's why Oracle shows both methods, just in case, and chooses the most efficient method in runtime.
In our table, at most 5 rows share a parent
(of more then 2 million), so it's very unlikely this plan branch will be ever chosen.
Find all ancestors
Nested sets
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM t_hierarchy hc
JOIN t_hierarchy hp
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hc.id = 1000000
ORDER BY
hp.lft
View query details
ID |
PARENT |
LFT |
RGT |
DATA |
2 |
0 |
585938 |
1171874 |
Value 2 |
12 |
2 |
703126 |
820312 |
Value 12 |
63 |
12 |
750001 |
773437 |
Value 63 |
319 |
63 |
764063 |
768749 |
Value 319 |
1599 |
319 |
766875 |
767811 |
Value 1599 |
7999 |
1599 |
767437 |
767623 |
Value 7999 |
39999 |
7999 |
767549 |
767585 |
Value 39999 |
199999 |
39999 |
767571 |
767577 |
Value 199999 |
1000000 |
199999 |
767576 |
767576 |
Value 1000000 |
9 rows fetched in 0.0005s (3.2186s) |
SELECT STATEMENT
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_LFT
The plan used is exactly the same as for the query to choose all descendants.
However, the range on lft
is too broad and the query works for more than 3 seconds.
Adjacency list
SELECT id, parent, lft, rgt, data, level
FROM t_hierarchy
START WITH
id = 1000000
CONNECT BY
id = PRIOR parent
ORDER BY
level DESC
View query details
ID |
PARENT |
LFT |
RGT |
DATA |
LEVEL |
2 |
0 |
585938 |
1171874 |
Value 2 |
9 |
12 |
2 |
703126 |
820312 |
Value 12 |
8 |
63 |
12 |
750001 |
773437 |
Value 63 |
7 |
319 |
63 |
764063 |
768749 |
Value 319 |
6 |
1599 |
319 |
766875 |
767811 |
Value 1599 |
5 |
7999 |
1599 |
767437 |
767623 |
Value 7999 |
4 |
39999 |
7999 |
767549 |
767585 |
Value 39999 |
3 |
199999 |
39999 |
767571 |
767577 |
Value 199999 |
2 |
1000000 |
199999 |
767576 |
767576 |
Value 1000000 |
1 |
9 rows fetched in 0.0005s (0.0009s) |
SELECT STATEMENT
SORT ORDER BY
CONNECT BY WITH FILTERING
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
NESTED LOOPS
BUFFER SORT
CONNECT BY PUMP
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
TABLE ACCESS FULL, 20090928_nested.T_HIERARCHY
This query completes in less than 1 ms, which is instant.
Note that unlike recursive CTE's, Oracle supplies a built-in preudocolumn, level
, that returns the current recursion level. This comes handy in cases like this.
Descendants up to a given level
Nested sets
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data
FROM t_hierarchy hp
JOIN t_hierarchy hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = ?
AND
(
SELECT COUNT(*)
FROM t_hierarchy hn
WHERE hc.lft BETWEEN hn.lft AND hn.rgt
AND hn.lft BETWEEN hp.lft AND hp.rgt
) <= 3
View the query for node 42
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data
FROM t_hierarchy hp
JOIN t_hierarchy hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = 42
AND
(
SELECT COUNT(*)
FROM t_hierarchy hn
WHERE hc.lft BETWEEN hn.lft AND hn.rgt
AND hn.lft BETWEEN hp.lft AND hp.rgt
) <= 3
ID |
PARENT |
LFT |
RGT |
DATA |
42 |
8 |
257814 |
281250 |
Value 42 |
211 |
42 |
257815 |
262501 |
Value 211 |
1056 |
211 |
257816 |
258752 |
Value 1056 |
1057 |
211 |
258753 |
259689 |
Value 1057 |
1058 |
211 |
259690 |
260626 |
Value 1058 |
1059 |
211 |
260627 |
261563 |
Value 1059 |
1060 |
211 |
261564 |
262500 |
Value 1060 |
212 |
42 |
262502 |
267188 |
Value 212 |
1061 |
212 |
262503 |
263439 |
Value 1061 |
1062 |
212 |
263440 |
264376 |
Value 1062 |
1063 |
212 |
264377 |
265313 |
Value 1063 |
1064 |
212 |
265314 |
266250 |
Value 1064 |
1065 |
212 |
266251 |
267187 |
Value 1065 |
213 |
42 |
267189 |
271875 |
Value 213 |
1066 |
213 |
267190 |
268126 |
Value 1066 |
1067 |
213 |
268127 |
269063 |
Value 1067 |
1068 |
213 |
269064 |
270000 |
Value 1068 |
1069 |
213 |
270001 |
270937 |
Value 1069 |
1070 |
213 |
270938 |
271874 |
Value 1070 |
214 |
42 |
271876 |
276562 |
Value 214 |
1071 |
214 |
271877 |
272813 |
Value 1071 |
1072 |
214 |
272814 |
273750 |
Value 1072 |
1073 |
214 |
273751 |
274687 |
Value 1073 |
1074 |
214 |
274688 |
275624 |
Value 1074 |
1075 |
214 |
275625 |
276561 |
Value 1075 |
215 |
42 |
276563 |
281249 |
Value 215 |
1076 |
215 |
276564 |
277500 |
Value 1076 |
1077 |
215 |
277501 |
278437 |
Value 1077 |
1078 |
215 |
278438 |
279374 |
Value 1078 |
1079 |
215 |
279375 |
280311 |
Value 1079 |
1080 |
215 |
280312 |
281248 |
Value 1080 |
31 rows fetched in 0.0000s (112.5589s) |
SELECT STATEMENT
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_LFT
SORT AGGREGATE
FILTER
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_LFT
View the query for node 31,415
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data
FROM t_hierarchy hp
JOIN t_hierarchy hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = 31415
AND
(
SELECT COUNT(*)
FROM t_hierarchy hn
WHERE hc.lft BETWEEN hn.lft AND hn.rgt
AND hn.lft BETWEEN hp.lft AND hp.rgt
) <= 3
ID |
PARENT |
LFT |
RGT |
DATA |
31415 |
6282 |
445651 |
445687 |
Value 31415 |
157076 |
31415 |
445652 |
445658 |
Value 157076 |
785381 |
157076 |
445653 |
445653 |
Value 785381 |
785382 |
157076 |
445654 |
445654 |
Value 785382 |
785383 |
157076 |
445655 |
445655 |
Value 785383 |
785384 |
157076 |
445656 |
445656 |
Value 785384 |
785385 |
157076 |
445657 |
445657 |
Value 785385 |
157077 |
31415 |
445659 |
445665 |
Value 157077 |
785386 |
157077 |
445660 |
445660 |
Value 785386 |
785387 |
157077 |
445661 |
445661 |
Value 785387 |
785388 |
157077 |
445662 |
445662 |
Value 785388 |
785389 |
157077 |
445663 |
445663 |
Value 785389 |
785390 |
157077 |
445664 |
445664 |
Value 785390 |
157078 |
31415 |
445666 |
445672 |
Value 157078 |
785391 |
157078 |
445667 |
445667 |
Value 785391 |
785392 |
157078 |
445668 |
445668 |
Value 785392 |
785393 |
157078 |
445669 |
445669 |
Value 785393 |
785394 |
157078 |
445670 |
445670 |
Value 785394 |
785395 |
157078 |
445671 |
445671 |
Value 785395 |
157079 |
31415 |
445673 |
445679 |
Value 157079 |
785396 |
157079 |
445674 |
445674 |
Value 785396 |
785397 |
157079 |
445675 |
445675 |
Value 785397 |
785398 |
157079 |
445676 |
445676 |
Value 785398 |
785399 |
157079 |
445677 |
445677 |
Value 785399 |
785400 |
157079 |
445678 |
445678 |
Value 785400 |
157080 |
31415 |
445680 |
445686 |
Value 157080 |
785401 |
157080 |
445681 |
445681 |
Value 785401 |
785402 |
157080 |
445682 |
445682 |
Value 785402 |
785403 |
157080 |
445683 |
445683 |
Value 785403 |
785404 |
157080 |
445684 |
445684 |
Value 785404 |
785405 |
157080 |
445685 |
445685 |
Value 785405 |
31 rows fetched in 0.0016s (0.0015s) |
SELECT STATEMENT
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_LFT
SORT AGGREGATE
FILTER
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_LFT
As in other systems, performance heavily depends on how many descendants does the record have.
For the node 42 which is close to the root node and has lots of descendants, the query runs for 122 seconds. For the node 31,415, the query takes 1.5 ms (next to instant).
Adjacency list
SELECT id, parent, lft, rgt, data
FROM t_hierarchy
START WITH
id = ?
CONNECT BY
parent = PRIOR id
AND level <= 3
Limiting the recursion depth query is very simple in Oracle: we just add the additional condition to CONNECT BY clause constraining the limit
.
View the query for node 42
SELECT id, parent, lft, rgt, data
FROM t_hierarchy
START WITH
id = 42
CONNECT BY
parent = PRIOR id
AND level <= 3
ID |
PARENT |
LFT |
RGT |
DATA |
42 |
8 |
257814 |
281250 |
Value 42 |
211 |
42 |
257815 |
262501 |
Value 211 |
1056 |
211 |
257816 |
258752 |
Value 1056 |
1057 |
211 |
258753 |
259689 |
Value 1057 |
1058 |
211 |
259690 |
260626 |
Value 1058 |
1059 |
211 |
260627 |
261563 |
Value 1059 |
1060 |
211 |
261564 |
262500 |
Value 1060 |
212 |
42 |
262502 |
267188 |
Value 212 |
1061 |
212 |
262503 |
263439 |
Value 1061 |
1062 |
212 |
263440 |
264376 |
Value 1062 |
1063 |
212 |
264377 |
265313 |
Value 1063 |
1064 |
212 |
265314 |
266250 |
Value 1064 |
1065 |
212 |
266251 |
267187 |
Value 1065 |
213 |
42 |
267189 |
271875 |
Value 213 |
1066 |
213 |
267190 |
268126 |
Value 1066 |
1067 |
213 |
268127 |
269063 |
Value 1067 |
1068 |
213 |
269064 |
270000 |
Value 1068 |
1069 |
213 |
270001 |
270937 |
Value 1069 |
1070 |
213 |
270938 |
271874 |
Value 1070 |
214 |
42 |
271876 |
276562 |
Value 214 |
1071 |
214 |
271877 |
272813 |
Value 1071 |
1072 |
214 |
272814 |
273750 |
Value 1072 |
1073 |
214 |
273751 |
274687 |
Value 1073 |
1074 |
214 |
274688 |
275624 |
Value 1074 |
1075 |
214 |
275625 |
276561 |
Value 1075 |
215 |
42 |
276563 |
281249 |
Value 215 |
1076 |
215 |
276564 |
277500 |
Value 1076 |
1077 |
215 |
277501 |
278437 |
Value 1077 |
1078 |
215 |
278438 |
279374 |
Value 1078 |
1079 |
215 |
279375 |
280311 |
Value 1079 |
1080 |
215 |
280312 |
281248 |
Value 1080 |
31 rows fetched in 0.0015s (0.0011s) |
SELECT STATEMENT
CONNECT BY WITH FILTERING
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
NESTED LOOPS
BUFFER SORT
CONNECT BY PUMP
FILTER
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_PARENT
TABLE ACCESS FULL, 20090928_nested.T_HIERARCHY
View the query for node 31,415
SELECT id, parent, lft, rgt, data
FROM t_hierarchy
START WITH
id = 31415
CONNECT BY
parent = PRIOR id
AND level <= 3
ID |
PARENT |
LFT |
RGT |
DATA |
31415 |
6282 |
445651 |
445687 |
Value 31415 |
157076 |
31415 |
445652 |
445658 |
Value 157076 |
785381 |
157076 |
445653 |
445653 |
Value 785381 |
785382 |
157076 |
445654 |
445654 |
Value 785382 |
785383 |
157076 |
445655 |
445655 |
Value 785383 |
785384 |
157076 |
445656 |
445656 |
Value 785384 |
785385 |
157076 |
445657 |
445657 |
Value 785385 |
157077 |
31415 |
445659 |
445665 |
Value 157077 |
785386 |
157077 |
445660 |
445660 |
Value 785386 |
785387 |
157077 |
445661 |
445661 |
Value 785387 |
785388 |
157077 |
445662 |
445662 |
Value 785388 |
785389 |
157077 |
445663 |
445663 |
Value 785389 |
785390 |
157077 |
445664 |
445664 |
Value 785390 |
157078 |
31415 |
445666 |
445672 |
Value 157078 |
785391 |
157078 |
445667 |
445667 |
Value 785391 |
785392 |
157078 |
445668 |
445668 |
Value 785392 |
785393 |
157078 |
445669 |
445669 |
Value 785393 |
785394 |
157078 |
445670 |
445670 |
Value 785394 |
785395 |
157078 |
445671 |
445671 |
Value 785395 |
157079 |
31415 |
445673 |
445679 |
Value 157079 |
785396 |
157079 |
445674 |
445674 |
Value 785396 |
785397 |
157079 |
445675 |
445675 |
Value 785397 |
785398 |
157079 |
445676 |
445676 |
Value 785398 |
785399 |
157079 |
445677 |
445677 |
Value 785399 |
785400 |
157079 |
445678 |
445678 |
Value 785400 |
157080 |
31415 |
445680 |
445686 |
Value 157080 |
785401 |
157080 |
445681 |
445681 |
Value 785401 |
785402 |
157080 |
445682 |
445682 |
Value 785402 |
785403 |
157080 |
445683 |
445683 |
Value 785403 |
785404 |
157080 |
445684 |
445684 |
Value 785404 |
785405 |
157080 |
445685 |
445685 |
Value 785405 |
31 rows fetched in 0.0015s (0.0011s) |
SELECT STATEMENT
CONNECT BY WITH FILTERING
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX UNIQUE SCAN, 20090928_nested.PK_HIERARCHY_ID
NESTED LOOPS
BUFFER SORT
CONNECT BY PUMP
FILTER
TABLE ACCESS BY INDEX ROWID, 20090928_nested.T_HIERARCHY
INDEX RANGE SCAN, 20090928_nested.IX_HIERARCHY_PARENT
TABLE ACCESS FULL, 20090928_nested.T_HIERARCHY
Both queries complete in 1 ms, that is instant.
Summary
Oracle implements a special clause, CONNECT BY
, that makes traversing adjacency lists in general and parent-child relationships in particular very easy, since this is exactly what this clause was intended for.
Except for finding all descendants of a single node (without finding out or limiting the depth), nested sets model is less efficient in Oracle than adjacency lists.
Adjacency lists are not only more efficient but the queries using them are more elegant and easy to maintain due to the native support.
That's why adjacency list model should be used instead of nested sets model in Oracle too, as well as in SQL Server and PostgreSQL 8.4.