EXPLAIN EXTENDED

How to create fast database queries

Hierarchical query in MySQL: limiting parents

with 5 comments

Answering questions asked on the site.

James asks:

Your series on hierarchical queries in MySQL is tremendous! I'm using it to create a series of threaded conversations.

I'm wondering if there is a way to paginate these results.

Specifically, let's say I want to limit the conversations to return 10 root nodes (parent=0) and all of their children in a query.

I can't just limit the final query, because that will clip off children. I've tried to add LIMITs to your stored functions, but I'm not getting the magic just right.

How would you go about doing this?

A quick reminder: MySQL does not support recursion (either CONNECT BY style or recursive CTE style), so using an adjacency list model is a somewhat complicated task.

However, it is still possible. The main idea is storing the recursion state in a session variable and call a user-defined function repeatedly to iterate over the tree, thus emulating recursion. The article mentioned in the question shows how to do that.

Normally, reading and assigning session variables in the same query is discouraged in MySQL, since the order of evaluation is not guaranteed. However, in the case we only use the table as a dummy recordset and no values of the records are actually used in the function, so the actual values returned by the function are completely defined by the function itself. The table is only used to ensure that the function is called enough times, and to present its results in form of a native resultset (which can be returned or joined with).

To do something with the logic of the function (like, imposing a limit on the parent nodes without doing the same on the child nodes), we, therefore, should tweak the function code, not the query that calls the functions. The only thing that matters in such a query is the number of records returned and we don't know it in design time.

Limiting the parent nodes is quite simple: we just use another session variable to track the number of parent branches yet to be returned and stop processing as soon as the limit is hit, that is the variable becomes zero.

Let's create a sample table and see how to do this:

Table creation details

There are 100,000 hierarchical records in multiple trees (8 records in each tree).

To limit the number of trees returned, we create the function similar to the one created in the earlier posts and add a little condition that would decrease the session variable, @parent_limit, each time a parent entry is returned. When this variable hits zero, it's a signal to stop processing the records:

CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _id INT;
DECLARE _parent INT;
DECLARE _next INT;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL;

SET _parent = @id;
SET _id = -1;

IF @id IS NULL THEN
RETURN NULL;
END IF;

LOOP
SELECT  MIN(id)
INTO    @id
FROM    t_hierarchy
WHERE   parent = _parent
AND id > _id;
IF @id IS NOT NULL OR _parent = @start_with THEN
SET @level = @level + 1;
IF _parent = @start_with AND @parent_limit > 0 THEN
SET @parent_limit = @parent_limit - 1;
END IF;
IF @parent_limit = 0 THEN
SET @id = NULL;
END IF;
RETURN @id;
END IF;
SET @level := @level - 1;
SELECT  id, parent
INTO    _id, _parent
FROM    t_hierarchy
WHERE   id = _parent;
END LOOP;
END;

Let's check it:

SELECT  CONCAT(REPEAT('    ', level - 1), CAST(hi.id AS CHAR)) AS treeitem, parent, level, lmt
FROM    (
SELECT  hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level, @parent_limit as lmt
FROM    (
SELECT  @start_with := 0,
@parent_limit := 4,
@id := @start_with,
@level := 0
) vars, t_hierarchy
WHERE   @id IS NOT NULL
) ho
JOIN    t_hierarchy hi
ON      hi.id = ho.id

treeitem parent level lmt
1 0 1 3
2 1 2 3
4 2 3 3
8 4 4 3
5 2 3 3
3 1 2 3
6 3 3 3
7 3 3 3
9 0 1 2
10 9 2 2
12 10 3 2
16 12 4 2
13 10 3 2
11 9 2 2
14 11 3 2
15 11 3 2
17 0 1 1
18 17 2 1
20 18 3 1
24 20 4 1
21 18 3 1
19 17 2 1
22 19 3 1
23 19 3 1
24 rows fetched in 0.0008s (0.0711s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 25 100.00
1 PRIMARY hi eq_ref PRIMARY PRIMARY 4 ho.id 1 100.00 Using where
2 DERIVED <derived3> system 1 100.00
2 DERIVED t_hierarchy index PRIMARY 4 100650 100.00 Using where; Using index
3 DERIVED No tables used
select concat(repeat('    ',(`ho`.`level` - 1)),cast(`20100418_limit`.`hi`.`id` as char charset latin1)) AS `treeitem`,`20100418_limit`.`hi`.`parent` AS `parent`,`ho`.`level` AS `level`,`ho`.`lmt` AS `lmt` from (select `hierarchy_connect_by_parent_eq_prior_id`(`20100418_limit`.`t_hierarchy`.`id`) AS `id`,(@level) AS `level`,(@parent_limit) AS `lmt` from (select (@start_with:=0) AS `@start_with := 0`,(@parent_limit:=4) AS `@parent_limit := 4`,(@id:=(@start_with)) AS `@id := @start_with`,(@level:=0) AS `@level := 0`) `vars` join `20100418_limit`.`t_hierarchy` where ((@id) is not null)) `ho` join `20100418_limit`.`t_hierarchy` `hi` where (`20100418_limit`.`hi`.`id` = `ho`.`id`)

The first 3 branches in a proper hierarchy, almost instantly. Note that we need set @parent_limit to 4, i. e. one greater than the value we need.

Note that it is possible to achieve the similar behavior without using any recursive functions at all.

On most forums the filtering is performed on the topic starters, so it's often a good idea to store the id of the topic starter along with each reply. In the sample table a did that as well.

With this, the filtering required becomes very simple:

SELECT  h.*
FROM    (
SELECT  id
FROM    t_hierarchy
WHERE   parent = 0
ORDER BY
id
LIMIT 3
) q
JOIN    t_hierarchy h
ON      h.root = q.id

id parent root
1 0 1
2 1 1
3 1 1
4 2 1
5 2 1
6 3 1
7 3 1
8 4 1
9 0 9
10 9 9
11 9 9
12 10 9
13 10 9
14 11 9
15 11 9
16 12 9
17 0 17
18 17 17
19 17 17
20 18 17
21 18 17
22 19 17
23 19 17
24 20 17
24 rows fetched in 0.0007s (0.0024s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 3 100.00
1 PRIMARY h ref ix_hierarchy_root ix_hierarchy_root 4 q.id 4 100.00 Using where
2 DERIVED t_hierarchy ref ix_hierarchy_parent ix_hierarchy_parent 4 16830 100.00 Using where; Using index
select `20100418_limit`.`h`.`id` AS `id`,`20100418_limit`.`h`.`parent` AS `parent`,`20100418_limit`.`h`.`root` AS `root` from (select `20100418_limit`.`t_hierarchy`.`id` AS `id` from `20100418_limit`.`t_hierarchy` where (`20100418_limit`.`t_hierarchy`.`parent` = 0) order by `20100418_limit`.`t_hierarchy`.`id` limit 3) `q` join `20100418_limit`.`t_hierarchy` `h` where (`20100418_limit`.`h`.`root` = `q`.`id`)

This solution (that had been used on numerous forum engines for ages) is more efficient, since no function calls are involved, and more simple too.

However, it does not preserve the hierarchical order and does not allow sorting on anything but the topic starter, so if you need anything of these, the recursive function is still a way to go.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

April 18th, 2010 at 11:00 pm

Posted in MySQL

5 Responses to 'Hierarchical query in MySQL: limiting parents'

Subscribe to comments with RSS

  1. Is there also a way to limit the result with just a few selected tree items?

    Like in the example above: all parents with id = 0 and only tree items with id IN (1,9)?

    ps, I’m really impressed by your work/ site, great!

    janwillem

    29 Jan 13 at 09:42

  2. Thank you! The second solution with simple JOIN query is exactly what I was looking for, for almost 2 years!

    One question though, why is it when I add ORDER BY h.root ASC for example, the query performance drops by a magnitude of at least a 1000. I found that if I wanted to ORDER the results I would need to wrap the whole set into SELECT * FROM (your solution here) ORDER BY xxxx.

    Any explanation to this?

    Denis Pshenov

    11 Feb 13 at 03:39

  3. @Denis: could you please post the whole queries (the slow one and the fast one) on http://sqlfiddle.com ?

    Quassnoi

    11 Feb 13 at 09:07

  4. Here is the slow query:
    http://sqlfiddle.com/#!2/600ad/4

    And here is the quick one: (notice no ORDER BY)
    http://sqlfiddle.com/#!2/600ad/5

    This is running against the wordpress wp_comments table. I see you run wordpress yourself so you can try it too.

    To solve the slow query problem and still be able to ORDER BY I did the following:
    http://sqlfiddle.com/#!2/600ad/6

    (Same behaviour is reproduced in all queries when ORDERing BY comment_date_gmt or any other field too)

    I wonder if you had an explanation why ORDER BY on a JOINED table does this.

    Warm Regards.

    Denis Pshenov

    11 Feb 13 at 21:22

  5. I must say that I forgot to add “AND comment_parent = 0” to the SELECT query that makes up the ‘q’ set in these examples. Please be advised about it when you check them.

    Denis Pshenov

    12 Feb 13 at 00:47

Leave a Reply