EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Shared Plan and Algorithm Network Cache (SPANC). You're welcome to read and comment on it.

Hierarchical queries in MySQL

There is no need in explaining how convenient hierarchical queries are.

A quick reminder: hierarchical data is a parent-child relationship contained in one table.

A typical task is to return values from the table in the following way:

  1. Resultset should be sorted like a tree, that is lexical sort by ancestry chains
  2. Depth level should be returned along with each row

It may sound confusing, but it’s very simple in fact, like shown on this Oracle query:

SELECT  LPAD(' ', level * 4, ' ') || id, parent, level
FROM    t_hierarchy
START WITH
        parent = 0
CONNECT BY
        parent = PRIOR id
id parent level
1 0 1
2 1 2
7 2 3
8 2 3
9 2 3
10 2 3
11 2 3
3 1 2
12 3 3
13 3 3
14 3 3
15 3 3
16 3 3
4 1 2
17 4 3
18 4 3
19 4 3
20 4 3
21 4 3
5 1 2
22 5 3
23 5 3
24 5 3
25 5 3
26 5 3
6 1 2
27 6 3
28 6 3
29 6 3
30 6 3
31 6 3

We have a nice tree sorted as a tree, with rows indented according to the depth level.

In the query above, START WITH defines the root of the tree, and CONNECT BY defines join condition between parent and child rows. Parent columns are defined by adding PRIOR keyword to them.

In MySQL there is no such construct, but it can be emulated.

We may try construct a sorting function somehow, but it will be too complex to implement and probably inefficient.

Instead, we will develop our function in the following way:

  1. The function will be called for each row in a rowset
  2. We will not use the actual values of the rowset, but instead keep all intermediary data in the session variables. Rowset will serve just as a dummy source of rows for subsequent joins
  3. We will not use recursion, as it’s impossible to keep the recursion state between function calls. Instead, we will just calculate next id given only current id

How do we find the next id given a current one?

  1. Initially, id is set to the id of the root we start with.
  2. If there is a row in the table with the same parent but greater id, we return that greater id, i. e. we return the next sibling of our row.
  3. If there current row is last of its parent, the parent becomes the current id, and we try repeat the step above.
  4. If the our row is the last child of the root, we return NULL for this call and all subsequent calls.

This is implemented as following:

Table creation details

CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
        DECLARE _id INT;
        DECLARE _parent INT;
        DECLARE _next INT;
        DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL;

        SET _parent = @id;
        SET _id = -1;

        IF @id IS NULL THEN
                RETURN NULL;
        END IF;

        LOOP
                SELECT  MIN(id)
                INTO    @id
                FROM    t_hierarchy
                WHERE   parent = _parent
                        AND id > _id;
                IF @id IS NOT NULL OR _parent = @start_with THEN
                        SET @level = @level + 1;
                        RETURN @id;
                END IF;
                SET @level := @level - 1;
                SELECT  id, parent
                INTO    _id, _parent
                FROM    t_hierarchy
                WHERE   id = _parent;
        END LOOP;
END

As you can see, this function takes a value as an input agrument, but doesn’t actually use it. This is because MySQL ignores NON DETERMINISTIC clause in the function definition, caches it’s return value and doesn’t actually call it if it’s called with the same arguments.

To avoid this, we need some non-repeating value, that is id.

We use this function in a query as following:

SELECT  CONCAT(REPEAT('    ', level - 1), CAST(hi.id AS CHAR)) AS treeitem, parent, level
FROM    (
        SELECT  hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level
        FROM    (
                SELECT  @start_with := 0,
                        @id := @start_with,
                        @level := 0
                ) vars, t_hierarchy
        WHERE   @id IS NOT NULL
        ) ho
JOIN    t_hierarchy hi
ON      hi.id = ho.id

In the first subquery, we initialize the variables and set our root item.

Note that it’s unadvisable to use NULL as a root, because it’s not comparable with an equality condition and therefore cannot be easily searched for without syntax change.

We use a 0 instead, because real id‘s traditionally start with 1.

We check for @id IS NOT NULL in the WHERE clause to optimize the query. Our function is written so that @id, being once set to NULL, will always remain NULL, and we can skip remaining rows.

For each row of our rowset (which is used only as an iterator loop), the function returns next id in a tree. We then join another instance of t_hierarchy on this id.

As a result of the query, we get:

treeitem parent level
1 0 1
2 1 2
7 2 3
32 7 4
157 32 5
782 157 6
3907 782 7
19532 3907 8
19533 3907 8
19534 3907 8
19535 3907 8
19536 3907 8
3908 782 7
19537 3908 8
19538 3908 8
97651 19530 8
19531 3906 7
97652 19531 8
97653 19531 8
97654 19531 8
97655 19531 8
97656 19531 8
97656 rows fetched in 1,1108s (13,7909s)

To be continued.

Written by Quassnoi

March 17th, 2009 at 11:00 pm

Posted in MySQL

Comments are closed.