Tags in nested sets: efficient indexing
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Travis asks:
I just read your excellent post: Hierarchical data in MySQL: parents and children in one query.
I am currently trying to do something very similar to this. The main difference is, I am working with data that is in a nested set.
I would like to construct a query that will return a resultset of nodes matched with
LIKE
, the ancestors of each of these nodes, and the immediate children of each of these nodes.
This is a very interesting question: it allows to demonstrate the usage of three types of indexes usable by MyISAM tables: BTREE
, SPATIAL
and FULLTEXT
.
Nested sets is one of two models that are most often used to store hierarchical data in a relational table (the other one being adjacency list model). Unlike adjacency list, nested sets does not require ability to write recursive queries, which SQL originally lacked and still lacks now in MySQL (albeit it can be emulated to some extent). It is widely used in MySQL world.
I described both methods and their comparison in the article I wrote some time ago:
The main problem with the nested sets model is that though it is extremely fast with selecting descendants of a given node, it is very slow in selecting ancestors.
This is because of the way the B-Tree indexes work. They can be used to query for values of a column within the range of two constants, but not for the values of two columns holding a single constant between them. One needs the first condition to select the children (found between the lft
and rgt
of the given node), and the second condition to select the ancestor (with lft
and rgt
containing the lft
and rgt
of the given node).
That's why selecting the children is fast and selecting the ancestors is slow.
To work around this, the sets that form the hierarchy can be described as geometrical objects, with the larger sets containing the smaller sets. These sets can be indexed with a SPATIAL
index which is designed specially for this purpose and both children and ancestors can be queried for very efficiently.
Unfortunately, finding the depth level is quite a hard task for the nested sets model even with the SPATIAL
indexes.
It would be quite an easy task is MySQL supported recursion: we could just run a query to find the siblings of each record by skipping their whole domains recursively.
However, MySQL's recursion support is very limited and it relies on the session variables, which are not recommended to use in the complex queries.
To cope with this, we need to mix the nested sets and the adjacency list models. Hierarchy will be stored in two seemingly redundant ways: the unique parent
and the LineString
representing the nested sets.
This will help us to use the R-Tree index to find all ancestors of a given node and also use B-Tree index to find its immediate children.
Finally, the question mentions using LIKE
to find the initial nodes. LIKE
predicate with the leading wildcards is not sargable in MySQL. However, it seems that the leading wildcards are only used to split the words. In this case, a FULLTEXT
index and the MATCH
query would be much more efficient, since FULLTEXT
index allows indexing a single record with several keys (each one corresponding to a single word in the column's text), so a search for the word in the space separated or a comma separated list uses the index and is much faster than scanning the whole table.
Hence, the query would use all three main types of indexes: BTREE
, SPATIAL
and FULLTEXT
.
To illustrate everything said above, let's create a sample table:
Table creation details
This table represents a 7 level hierarchy, with 5 children to each non-leaf item and combines nested sets and adjacency list models.
The sets are stored in the plain lft
and rgt
columns as well as in the combined column of type LineString
which represents a diagonal of a box horizontally spanning the interval from lft
to rgt
.
Selecting nodes using FULLTEXT
First, let's select the nodes tagged with a given tag.
RLIKE
can be used for that but it is not very efficient:
SELECT id, tags FROM t_hierarchy WHERE tags RLIKE '[[:<:]]tag13480[[:>:]]'
This query uses full table scan and runs for 1.25 second.
To improve the query, we should rewrite the WHERE
condition using MATCH
predicate (which in its turn allows a FULLTEXT
index to be used for the search):
SELECT id, tags FROM t_hierarchy WHERE MATCH(tags) AGAINST('+tag13480' IN BOOLEAN MODE)
id | tags | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | tag13480 tag33087 tag124996 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
440166 | tag248489 tag271789 tag13480 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
212847 | tag104605 tag13480 tag53585 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
161378 | tag181040 tag231320 tag13480 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
324394 | tag13480 tag181947 tag269297 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
400074 | tag13480 tag176642 tag242772 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 rows fetched in 0.0005s (0.0043s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | t_hierarchy | fulltext | fx_hierarchy_tags | fx_hierarchy_tags | 0 | 1 | 100.00 | Using where |
select `20100216_index`.`t_hierarchy`.`id` AS `id`,`20100216_index`.`t_hierarchy`.`tags` AS `tags` from `20100216_index`.`t_hierarchy` where (match `20100216_index`.`t_hierarchy`.`tags` against ('+tag13480' in boolean mode))
This query returns the same results but does it much faster, in only 4 ms.
Selecting ancestors using SPATIAL
Now, when we have a list of nodes, we should build the list of ancestors for each node.
Since our model combines adjacency list and nested sets, it is possible to use either representation to build a query. However, the adjacency list model requires recursion, and, while it is possible to emulate it, it only works for a single-node query.
With nested sets, selecting the list of ancestors is much more simple: we should just selecting all records whose sets
contains the sets
of the node. This can be done using MBRContains
(which is capable of using the SPATIAL
index).
The query, however, will return us the ancestors in a plain list. To find out the level of each ancestor, we should put some more effort. Since the sets are nested and the lft
and rgt
fields naturally maintain the hierarchical order, it is enough just to enumerate the ancestors in that order. It would be very simple to do — if only MySQL supported ROW_NUMBER()
. But it doesn't, of course, so to enumerate the ancestors we should self-join them and just count the number of each ancestor's ancestors:
SELECT hc.id, ha.id, COUNT(*) AS cnt FROM t_hierarchy hc STRAIGHT_JOIN t_hierarchy ha ON MBRContains(ha.sets, hc.sets) STRAIGHT_JOIN t_hierarchy hcnt ON MBRContains(hcnt.sets, ha.sets) WHERE MATCH(hc.tags) AGAINST('+tag13480' IN BOOLEAN MODE) GROUP BY hc.id, ha.id ORDER BY hc.id, cnt
The algorithm that calculates the level of each ancestor is not very efficient, however, it does its job quite well, and the query completes in only 20 ms.
Selecting immediate children using BTREE
Selecting the immediate children is the easiest task: we just need an equijoin on parent
. The level of each of the node's children will be that of the node plus 1, so calculating it is quite simple too:
SELECT h.id, hc.id, cnt + 1 FROM ( SELECT hn.id, COUNT(*) AS cnt FROM t_hierarchy hn STRAIGHT_JOIN t_hierarchy hcnt ON MBRContains(hcnt.sets, hn.sets) WHERE MATCH(hn.tags) AGAINST('+tag13480' IN BOOLEAN MODE) GROUP BY hn.id ) h JOIN t_hierarchy hc ON hc.parent = h.id
Putting it together
Now we should just combine the two queries and apply nice formatting to them:
SELECT h.id, CONCAT(LPAD('', (level - 1) * 2, ' '), h.data) AS name, CASE WHEN hq.id = hq.node THEN '*' ELSE '' END AS hit FROM ( SELECT hc.id AS node, ha.id AS id, COUNT(*) AS level FROM t_hierarchy hc STRAIGHT_JOIN t_hierarchy ha ON MBRContains(ha.sets, hc.sets) STRAIGHT_JOIN t_hierarchy hcnt ON MBRContains(hcnt.sets, ha.sets) WHERE MATCH(hc.tags) AGAINST('+tag13480' IN BOOLEAN MODE) GROUP BY hc.id, ha.id UNION ALL SELECT h.id, hc.id, cnt + 1 AS level FROM ( SELECT hn.id, COUNT(*) AS cnt FROM t_hierarchy hn STRAIGHT_JOIN t_hierarchy hcnt ON MBRContains(hcnt.sets, hn.sets) WHERE MATCH(hn.tags) AGAINST('+tag13480' IN BOOLEAN MODE) GROUP BY hn.id ) h JOIN t_hierarchy hc ON hc.parent = h.id ) hq JOIN t_hierarchy h ON h.id = hq.id ORDER BY node, level, lft
id | name | hit | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Value 3 | * | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | Value 16 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17 | Value 17 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 | Value 18 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
19 | Value 19 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 | Value 20 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | Value 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | Value 10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
51 | Value 51 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
257 | Value 257 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1290 | Value 1290 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6454 | Value 6454 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
32275 | Value 32275 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
161378 | Value 161378 | * | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | Value 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | Value 13 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
67 | Value 67 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
340 | Value 340 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1702 | Value 1702 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8513 | Value 8513 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42569 | Value 42569 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
212847 | Value 212847 | * | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | Value 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 | Value 20 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
103 | Value 103 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
518 | Value 518 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2594 | Value 2594 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12975 | Value 12975 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64878 | Value 64878 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
324394 | Value 324394 | * | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | Value 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
25 | Value 25 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
127 | Value 127 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
639 | Value 639 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3200 | Value 3200 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16002 | Value 16002 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
80014 | Value 80014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
400074 | Value 400074 | * | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | Value 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
27 | Value 27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
140 | Value 140 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
704 | Value 704 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3521 | Value 3521 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17606 | Value 17606 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
88033 | Value 88033 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
440166 | Value 440166 | * | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
46 rows fetched in 0.0039s (0.0466s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 46 | 100.00 | Using temporary; Using filesort | ||||
1 | PRIMARY | h | eq_ref | PRIMARY | PRIMARY | 4 | hq.id | 1 | 100.00 | |
2 | DERIVED | hc | fulltext | sx_hierarchy_sets,fx_hierarchy_tags | fx_hierarchy_tags | 0 | 1 | 100.00 | Using where; Using temporary; Using filesort | |
2 | DERIVED | ha | range | sx_hierarchy_sets | sx_hierarchy_sets | 34 | 1 | 48828000.00 | Range checked for each record (index map: 0x10) | |
2 | DERIVED | hcnt | range | sx_hierarchy_sets | sx_hierarchy_sets | 34 | 1 | 48828000.00 | Range checked for each record (index map: 0x10) | |
3 | UNION | <derived4> | ALL | 6 | 100.00 | |||||
3 | UNION | hc | ref | ix_hierarchy_parent | ix_hierarchy_parent | 4 | h.id | 9207 | 100.01 | |
4 | DERIVED | hn | fulltext | sx_hierarchy_sets,fx_hierarchy_tags | fx_hierarchy_tags | 0 | 1 | 100.00 | Using where; Using temporary; Using filesort | |
4 | DERIVED | hcnt | range | sx_hierarchy_sets | sx_hierarchy_sets | 34 | 1 | 48828000.00 | Range checked for each record (index map: 0x10) | |
UNION RESULT | <union2,3> | ALL |
select `20100216_index`.`h`.`id` AS `id`,concat(convert(lpad('',((`hq`.`level` - 1) * 2),' ') using utf8),`20100216_index`.`h`.`data`) AS `name`,(case when (`hq`.`id` = `hq`.`node`) then '*' else '' end) AS `hit` from (select `20100216_index`.`hc`.`id` AS `node`,`20100216_index`.`ha`.`id` AS `id`,count(0) AS `level` from `20100216_index`.`t_hierarchy` `hc` straight_join `20100216_index`.`t_hierarchy` `ha` straight_join `20100216_index`.`t_hierarchy` `hcnt` where (match `20100216_index`.`hc`.`tags` against ('+tag13480' in boolean mode)) group by `20100216_index`.`hc`.`id`,`20100216_index`.`ha`.`id` union all select `h`.`id` AS `id`,`20100216_index`.`hc`.`id` AS `id`,(`h`.`cnt` + 1) AS `level` from (select `20100216_index`.`hn`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hn` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hn`.`tags` against ('+tag13480' in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`hn`.`sets`)) group by `20100216_index`.`hn`.`id`) `h` join `20100216_index`.`t_hierarchy` `hc`) `hq` join `20100216_index`.`t_hierarchy` `h` where (`20100216_index`.`h`.`id` = `hq`.`id`) order by `hq`.`node`,`hq`.`level`,`20100216_index`.`h`.`lft`
For each matching node, the query returns the node itself, its ancestors and its immediate children. The matching nodes are marked with the asterisk in the field hit
.
The query efficiently combines the FULLTEXT
, SPATIAL
and BTREE
indexes and completes in only 40 ms.
Hope that helps.
I'm always glad to answer the questions regarding database queries.