Tags in nested sets: efficient indexing at EXPLAIN EXTENDED

Tags in nested sets: efficient indexing

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Travis asks:

I just read your excellent post: Hierarchical data in MySQL: parents and children in one query.

I am currently trying to do something very similar to this. The main difference is, I am working with data that is in a nested set.

I would like to construct a query that will return a resultset of nodes matched with LIKE, the ancestors of each of these nodes, and the immediate children of each of these nodes.

This is a very interesting question: it allows to demonstrate the usage of three types of indexes usable by MyISAM tables: BTREE, SPATIAL and FULLTEXT.

Nested sets is one of two models that are most often used to store hierarchical data in a relational table (the other one being adjacency list model). Unlike adjacency list, nested sets does not require ability to write recursive queries, which SQL originally lacked and still lacks now in MySQL (albeit it can be emulated to some extent). It is widely used in MySQL world.

I described both methods and their comparison in the article I wrote some time ago:

Adjacency list vs. nested sets: MySQL

The main problem with the nested sets model is that though it is extremely fast with selecting descendants of a given node, it is very slow in selecting ancestors.

This is because of the way the B-Tree indexes work. They can be used to query for values of a column within the range of two constants, but not for the values of two columns holding a single constant between them. One needs the first condition to select the children (found between the lft and rgt of the given node), and the second condition to select the ancestor (with lft and rgt containing the lft and rgt of the given node).

That's why selecting the children is fast and selecting the ancestors is slow.

To work around this, the sets that form the hierarchy can be described as geometrical objects, with the larger sets containing the smaller sets. These sets can be indexed with a SPATIAL index which is designed specially for this purpose and both children and ancestors can be queried for very efficiently.

Unfortunately, finding the depth level is quite a hard task for the nested sets model even with the SPATIAL indexes.

It would be quite an easy task is MySQL supported recursion: we could just run a query to find the siblings of each record by skipping their whole domains recursively.

However, MySQL's recursion support is very limited and it relies on the session variables, which are not recommended to use in the complex queries.

To cope with this, we need to mix the nested sets and the adjacency list models. Hierarchy will be stored in two seemingly redundant ways: the unique parent and the LineString representing the nested sets.

This will help us to use the R-Tree index to find all ancestors of a given node and also use B-Tree index to find its immediate children.

Finally, the question mentions using LIKE to find the initial nodes. LIKE predicate with the leading wildcards is not sargable in MySQL. However, it seems that the leading wildcards are only used to split the words. In this case, a FULLTEXT index and the MATCH query would be much more efficient, since FULLTEXT index allows indexing a single record with several keys (each one corresponding to a single word in the column's text), so a search for the word in the space separated or a comma separated list uses the index and is much faster than scanning the whole table.

Hence, the query would use all three main types of indexes: BTREE, SPATIAL and FULLTEXT.

To illustrate everything said above, let's create a sample table:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_hierarchy (
id INT NOT NULL PRIMARY KEY,
parent INT NOT NULL,
lft INT NOT NULL,
rgt INT NOT NULL,
sets LineString NOT NULL,
data VARCHAR(100) NOT NULL,
tags TEXT NOT NULL
) ENGINE=MyISAM;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END;

CREATE PROCEDURE prc_hierarchy(width INT)
main:BEGIN
        DECLARE last INT;
        DECLARE level INT;
        SET last = 0;
        SET level = 0;
        WHILE width &gt;= 1 DO
INSERT
INTO    t_hierarchy
SELECT  COALESCE(h.id, 0) * 5 + f.id,
COALESCE(h.id, 0),
COALESCE(h.lft, 0) + 1 + (f.id - 1) * width,
COALESCE(h.lft, 0) + f.id * width,
LineString(
Point(-1, COALESCE(h.lft, 0) + 1 + (f.id - 1) * width),
Point(1, COALESCE(h.lft, 0) + f.id * width)
),
CONCAT('Value ', COALESCE(h.id, 0) * 5 + f.id),
(
SELECT  GROUP_CONCAT(CONCAT('tag', FLOOR(RAND(20100216 + width * 1000) * 300000)) SEPARATOR ' ')
FROM    (
SELECT  1
UNION ALL
SELECT  1
UNION ALL
SELECT  1
) q
)
FROM    filler f
LEFT JOIN
t_hierarchy h
ON      h.id &gt;= last;
SET width = width / 5;
SET last = last + POWER(5, level);
SET level = level + 1;
END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(5);
CALL prc_hierarchy(117187);
COMMIT;

CREATE INDEX ix_hierarchy_parent ON t_hierarchy (parent);
CREATE UNIQUE INDEX ix_hierarchy_lft ON t_hierarchy (lft);
CREATE UNIQUE INDEX ix_hierarchy_rgt ON t_hierarchy (rgt);
CREATE SPATIAL INDEX sx_hierarchy_sets ON t_hierarchy (sets);
CREATE FULLTEXT INDEX fx_hierarchy_tags ON t_hierarchy (tags);

This table represents a 7 level hierarchy, with 5 children to each non-leaf item and combines nested sets and adjacency list models.

The sets are stored in the plain lft and rgt columns as well as in the combined column of type LineString which represents a diagonal of a box horizontally spanning the interval from lft to rgt.

Selecting nodes using FULLTEXT

First, let's select the nodes tagged with a given tag.

RLIKE can be used for that but it is not very efficient:

SELECT  id, tags
FROM    t_hierarchy
WHERE   tags RLIKE '[[:&lt;:]]tag13480[[:&gt;:]]'

View query results

id	tags
3	tag13480 tag33087 tag124996
440166	tag248489 tag271789 tag13480
212847	tag104605 tag13480 tag53585
161378	tag181040 tag231320 tag13480
324394	tag13480 tag181947 tag269297
400074	tag13480 tag176642 tag242772
6 rows fetched in 0.0006s (1.2500s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_hierarchy	ALL					488280	100.00	Using where

select `20100216_index`.`t_hierarchy`.`id` AS `id`,`20100216_index`.`t_hierarchy`.`tags` AS `tags` from `20100216_index`.`t_hierarchy` where (`20100216_index`.`t_hierarchy`.`tags` regexp '[[:<:]]tag13480[[:>:]]')

This query uses full table scan and runs for 1.25 second.

To improve the query, we should rewrite the WHERE condition using MATCH predicate (which in its turn allows a FULLTEXT index to be used for the search):

SELECT  id, tags
FROM    t_hierarchy
WHERE   MATCH(tags) AGAINST('+tag13480' IN BOOLEAN MODE)

id	tags
3	tag13480 tag33087 tag124996
440166	tag248489 tag271789 tag13480
212847	tag104605 tag13480 tag53585
161378	tag181040 tag231320 tag13480
324394	tag13480 tag181947 tag269297
400074	tag13480 tag176642 tag242772
6 rows fetched in 0.0005s (0.0043s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	t_hierarchy	fulltext	fx_hierarchy_tags	fx_hierarchy_tags	0		1	100.00	Using where

select `20100216_index`.`t_hierarchy`.`id` AS `id`,`20100216_index`.`t_hierarchy`.`tags` AS `tags` from `20100216_index`.`t_hierarchy` where (match `20100216_index`.`t_hierarchy`.`tags` against ('+tag13480' in boolean mode))

This query returns the same results but does it much faster, in only 4 ms.

Selecting ancestors using SPATIAL

Now, when we have a list of nodes, we should build the list of ancestors for each node.

Since our model combines adjacency list and nested sets, it is possible to use either representation to build a query. However, the adjacency list model requires recursion, and, while it is possible to emulate it, it only works for a single-node query.

With nested sets, selecting the list of ancestors is much more simple: we should just selecting all records whose sets contains the sets of the node. This can be done using MBRContains (which is capable of using the SPATIAL index).

The query, however, will return us the ancestors in a plain list. To find out the level of each ancestor, we should put some more effort. Since the sets are nested and the lft and rgt fields naturally maintain the hierarchical order, it is enough just to enumerate the ancestors in that order. It would be very simple to do — if only MySQL supported ROW_NUMBER(). But it doesn't, of course, so to enumerate the ancestors we should self-join them and just count the number of each ancestor's ancestors:

SELECT  hc.id, ha.id, COUNT(*) AS cnt
FROM    t_hierarchy hc
STRAIGHT_JOIN
t_hierarchy ha
ON      MBRContains(ha.sets, hc.sets)
STRAIGHT_JOIN
t_hierarchy hcnt
ON      MBRContains(hcnt.sets, ha.sets)
WHERE   MATCH(hc.tags) AGAINST('+tag13480' IN BOOLEAN MODE)
GROUP BY
hc.id, ha.id
ORDER BY
hc.id, cnt

View query results

id	id	cnt
3	3	1
161378	1	1
161378	10	2
161378	51	3
161378	257	4
161378	1290	5
161378	6454	6
161378	32275	7
161378	161378	8
212847	2	1
212847	13	2
212847	67	3
212847	340	4
212847	1702	5
212847	8513	6
212847	42569	7
212847	212847	8
324394	3	1
324394	20	2
324394	103	3
324394	518	4
324394	2594	5
324394	12975	6
324394	64878	7
324394	324394	8
400074	4	1
400074	25	2
400074	127	3
400074	639	4
400074	3200	5
400074	16002	6
400074	80014	7
400074	400074	8
440166	5	1
440166	27	2
440166	140	3
440166	704	4
440166	3521	5
440166	17606	6
440166	88033	7
440166	440166	8
41 rows fetched in 0.0036s (0.0258s)

id	select_type	table	type	possible_keys	key	key_len	rows	filtered	Extra
1	SIMPLE	hc	fulltext	sx_hierarchy_sets,fx_hierarchy_tags	fx_hierarchy_tags	0	1	100.00	Using where; Using temporary; Using filesort
1	SIMPLE	ha	ALL	sx_hierarchy_sets			488280	100.00	Range checked for each record (index map: 0x10)
1	SIMPLE	hcnt	ALL	sx_hierarchy_sets			488280	100.00	Range checked for each record (index map: 0x10)

select `20100216_index`.`hc`.`id` AS `id`,`20100216_index`.`ha`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hc` straight_join `20100216_index`.`t_hierarchy` `ha` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hc`.`tags` against ('+tag13480' in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`ha`.`sets`) and contains(`20100216_index`.`ha`.`sets`,`20100216_index`.`hc`.`sets`)) group by `20100216_index`.`hc`.`id`,`20100216_index`.`ha`.`id` order by `20100216_index`.`hc`.`id`,count(0)

The algorithm that calculates the level of each ancestor is not very efficient, however, it does its job quite well, and the query completes in only 20 ms.

Selecting immediate children using BTREE

Selecting the immediate children is the easiest task: we just need an equijoin on parent. The level of each of the node's children will be that of the node plus 1, so calculating it is quite simple too:

SELECT  h.id, hc.id, cnt + 1
FROM    (
SELECT  hn.id, COUNT(*) AS cnt
FROM    t_hierarchy hn
STRAIGHT_JOIN
t_hierarchy hcnt
ON      MBRContains(hcnt.sets, hn.sets)
WHERE   MATCH(hn.tags) AGAINST('+tag13480' IN BOOLEAN MODE)
GROUP BY
hn.id
) h
JOIN    t_hierarchy hc
ON      hc.parent = h.id

View query results

id	id	cnt + 1
3	16	2
3	17	2
3	18	2
3	19	2
3	20	2
5 rows fetched in 0.0006s (0.0093s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					6	100.00
1	PRIMARY	hc	ref	ix_hierarchy_parent	ix_hierarchy_parent	4	h.id	9207	100.01
2	DERIVED	hn	fulltext	sx_hierarchy_sets,fx_hierarchy_tags	fx_hierarchy_tags	0		1	100.00	Using where; Using temporary; Using filesort
2	DERIVED	hcnt	range	sx_hierarchy_sets	sx_hierarchy_sets	34		1	48828000.00	Range checked for each record (index map: 0x10)

select `h`.`id` AS `id`,`20100216_index`.`hc`.`id` AS `id`,(`h`.`cnt` + 1) AS `cnt + 1` from (select `20100216_index`.`hn`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hn` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hn`.`tags` against ('+tag13480' in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`hn`.`sets`)) group by `20100216_index`.`hn`.`id`) `h` join `20100216_index`.`t_hierarchy` `hc` where (`20100216_index`.`hc`.`parent` = `h`.`id`)

Putting it together

Now we should just combine the two queries and apply nice formatting to them:

SELECT  h.id,
CONCAT(LPAD('', (level - 1) * 2, ' '), h.data) AS name,
CASE WHEN hq.id = hq.node THEN '*' ELSE '' END AS hit
FROM    (
SELECT  hc.id AS node, ha.id AS id, COUNT(*) AS level
FROM    t_hierarchy hc
STRAIGHT_JOIN
t_hierarchy ha
ON      MBRContains(ha.sets, hc.sets)
STRAIGHT_JOIN
t_hierarchy hcnt
ON      MBRContains(hcnt.sets, ha.sets)
WHERE   MATCH(hc.tags) AGAINST('+tag13480' IN BOOLEAN MODE)
GROUP BY
hc.id, ha.id
UNION ALL
SELECT  h.id, hc.id, cnt + 1 AS level
FROM    (
SELECT  hn.id, COUNT(*) AS cnt
FROM    t_hierarchy hn
STRAIGHT_JOIN
t_hierarchy hcnt
ON      MBRContains(hcnt.sets, hn.sets)
WHERE   MATCH(hn.tags) AGAINST('+tag13480' IN BOOLEAN MODE)
GROUP BY
hn.id
) h
JOIN    t_hierarchy hc
ON      hc.parent = h.id
) hq
JOIN    t_hierarchy h
ON      h.id = hq.id
ORDER BY
node, level, lft

id	name	hit
3	Value 3	*
16	Value 16
17	Value 17
18	Value 18
19	Value 19
20	Value 20
1	Value 1
10	Value 10
51	Value 51
257	Value 257
1290	Value 1290
6454	Value 6454
32275	Value 32275
161378	Value 161378	*
2	Value 2
13	Value 13
67	Value 67
340	Value 340
1702	Value 1702
8513	Value 8513
42569	Value 42569
212847	Value 212847	*
3	Value 3
20	Value 20
103	Value 103
518	Value 518
2594	Value 2594
12975	Value 12975
64878	Value 64878
324394	Value 324394	*
4	Value 4
25	Value 25
127	Value 127
639	Value 639
3200	Value 3200
16002	Value 16002
80014	Value 80014
400074	Value 400074	*
5	Value 5
27	Value 27
140	Value 140
704	Value 704
3521	Value 3521
17606	Value 17606
88033	Value 88033
440166	Value 440166	*
46 rows fetched in 0.0039s (0.0466s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					46	100.00	Using temporary; Using filesort
1	PRIMARY	h	eq_ref	PRIMARY	PRIMARY	4	hq.id	1	100.00
2	DERIVED	hc	fulltext	sx_hierarchy_sets,fx_hierarchy_tags	fx_hierarchy_tags	0		1	100.00	Using where; Using temporary; Using filesort
2	DERIVED	ha	range	sx_hierarchy_sets	sx_hierarchy_sets	34		1	48828000.00	Range checked for each record (index map: 0x10)
2	DERIVED	hcnt	range	sx_hierarchy_sets	sx_hierarchy_sets	34		1	48828000.00	Range checked for each record (index map: 0x10)
3	UNION	<derived4>	ALL					6	100.00
3	UNION	hc	ref	ix_hierarchy_parent	ix_hierarchy_parent	4	h.id	9207	100.01
4	DERIVED	hn	fulltext	sx_hierarchy_sets,fx_hierarchy_tags	fx_hierarchy_tags	0		1	100.00	Using where; Using temporary; Using filesort
4	DERIVED	hcnt	range	sx_hierarchy_sets	sx_hierarchy_sets	34		1	48828000.00	Range checked for each record (index map: 0x10)
	UNION RESULT	<union2,3>	ALL

select `20100216_index`.`h`.`id` AS `id`,concat(convert(lpad('',((`hq`.`level` - 1) * 2),' ') using utf8),`20100216_index`.`h`.`data`) AS `name`,(case when (`hq`.`id` = `hq`.`node`) then '*' else '' end) AS `hit` from (select `20100216_index`.`hc`.`id` AS `node`,`20100216_index`.`ha`.`id` AS `id`,count(0) AS `level` from `20100216_index`.`t_hierarchy` `hc` straight_join `20100216_index`.`t_hierarchy` `ha` straight_join `20100216_index`.`t_hierarchy` `hcnt` where (match `20100216_index`.`hc`.`tags` against ('+tag13480' in boolean mode)) group by `20100216_index`.`hc`.`id`,`20100216_index`.`ha`.`id` union all select `h`.`id` AS `id`,`20100216_index`.`hc`.`id` AS `id`,(`h`.`cnt` + 1) AS `level` from (select `20100216_index`.`hn`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hn` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hn`.`tags` against ('+tag13480' in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`hn`.`sets`)) group by `20100216_index`.`hn`.`id`) `h` join `20100216_index`.`t_hierarchy` `hc`) `hq` join `20100216_index`.`t_hierarchy` `h` where (`20100216_index`.`h`.`id` = `hq`.`id`) order by `hq`.`node`,`hq`.`level`,`20100216_index`.`h`.`lft`

For each matching node, the query returns the node itself, its ancestors and its immediate children. The matching nodes are marked with the asterisk in the field hit.

The query efficiently combines the FULLTEXT, SPATIAL and BTREE indexes and completes in only 40 ms.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

February 16th, 2010 at 11:00 pm

Posted in MySQL

EXPLAIN EXTENDED