Hierarchical queries in MySQL: finding loops
This is a series of articles on hierarchical queries in MySQL:
- Hierarchical queries in MySQL
- Hierarchical queries in MySQL: adding level
- Hierarchical queries in MySQL: adding ancestry chains.
- Hierarchical queries in MySQL: finding leaves
- Hierarchical queries in MySQL: finding loops
See also:
Today, we will check our structure for loops.
Loops in hierarchical queries occur when a row contains itself in its ancestry chain.
In general, it's a sign of a database logic flaw and should be avioded, but sometimes it's deliberately used.
The simplest and most easily detectable case is having id
equal to parent
(meaning that the row is both its father and child). This may be tested by using a simple comparison condition.
But a loop of course can be more complex: a row can be its own grandfather, grand-grandfather etc.
To detect such loops, we will improve our functions a little.
We'll add a loop to our data:
SELECT * FROM t_hierarchy WHERE id >= 97657
id | parent | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97657 | 97658 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
97658 | 97659 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
97659 | 97657 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
97660 | 97657 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0001s (0.0016s) |
Our old function will be fooled by this loop and it will return the values over and over again each time it will be called.
To avoid this, we will rewrite our functions a little.
First, we need CONNECT BY
to break when a loop is detected. On each iteration, we test if the current @id
is equal to @start_with
. If yes, then we have a loop in our data and it's just the right time to stop iterating.
It's quite straightforward:
CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id_with_level_and_loop(value INT, maxlevel INT) RETURNS INT NOT DETERMINISTIC READS SQL DATA BEGIN DECLARE _id INT; DECLARE _parent INT; DECLARE _next INT; DECLARE _i INT; DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL; SET _parent = @id; SET _id = -1; SET _i = 0; IF @id IS NULL THEN RETURN NULL; END IF; LOOP SELECT MIN(id) INTO @id FROM t_hierarchy WHERE parent = _parent AND id > _id -- Checking for @start_with in descendants AND id <> @start_with AND COALESCE(@level < maxlevel, TRUE); IF @id IS NOT NULL OR _parent = @start_with THEN SET @level = @level + 1; RETURN @id; END IF; SET @level := @level - 1; SELECT id, parent INTO _id, _parent FROM t_hierarchy WHERE id = _parent; SET _i = _i + 1; END LOOP; RETURN NULL; END
Second, we need to check if a current row is inside a hierarchy loop. The check should return true if the row itself is among its descendants. We probably could iterate all descendants and see if we encounter the row's id
.
But a row can have more than one descendant, so this algorithm would be quite inefficient.
We can exploit the fact that if a row is among its descendants, then it should also be among its ancestors too.
Ancestors, unlike descendants, are unique for any given generation (i. e. a row can have exactly one father, exactly one grandfather etc). That means that if we encounter any ancestor more than once then we have a loop somewhere in the ancestry chain.
But having a loop in the ancestry chain does not necessarily mean being inside the loop. Rows, as we can remember, can have more than one child. If a row has two children, it may have itself somewhere among the first child's descendants, but not among the second child's ones.
In the sample data above, rows 97658 and 97660 both have 97557 as a parent, but the first one is inside the cycle (it has its parent among its descendants) while the second one is not.
To check for the cycle, we will do the following:
- We iterate up the ancestry chain until we find an orphan row or encounter
@start_with
twice. - Each time we find the current row in it's own ancestry chain, we increment a counter.
- Having an orphan row in the ancestry chain mean not having a cycle, in this case we just return 0.
- Encountering
@start_with
twice means completing a cycle. If the current row is contained in the cycle, then we should have encountered it exactly one time. If it is not in the cycle, we shouldn't have encountered it at all. That's why we just return the counter value as a boolean result.
The function will look like this:
CREATE FUNCTION hierarchy_connect_by_iscycle(node INT) RETURNS INT NOT DETERMINISTIC READS SQL DATA BEGIN DECLARE _id INT; DECLARE _loop INT; DECLARE _node INT; DECLARE EXIT HANDLER FOR NOT FOUND RETURN 0; SET _id = COALESCE(node, @id); SET _loop = 0; SET _node = 0; LOOP SELECT parent INTO _id FROM t_hierarchy WHERE id = _id; IF _id = @start_with THEN SET _loop := _loop + 1; END IF; IF _id = COALESCE(node, @id) THEN SET _node = _node + 1; END IF; IF _loop >= 2 THEN RETURN _node; END IF; END LOOP; END
Finally, we change the way we check the rows for being leaves, namely, we not just select the first child of a row, but we select the first child not equal to @start_with
. If @start_with
is the only child of a row, the rows is counted as a leaf, as @start_with
will never be selected as a child.
Combining all said above, we come out with the following query:
SELECT CONCAT(REPEAT(' ', lvl - 1), hi.id) AS treeitem, hierarchy_sys_connect_by_path('/', hi.id) AS path, parent, lvl, CASE WHEN lvl >= @maxlevel THEN 1 ELSE COALESCE( ( SELECT 0 FROM t_hierarchy hl WHERE hl.parent = ho.id AND hl.id <> @start_with LIMIT 1 ), 1) END AS is_leaf, hierarchy_connect_by_iscycle(hi.id) AS is_cycle FROM ( SELECT hierarchy_connect_by_parent_eq_prior_id_with_level_and_loop(id, @maxlevel) AS id, CAST(@level AS SIGNED) AS lvl FROM ( SELECT @start_with := 97657, @id := @start_with, @level := 0, @maxlevel := NULL ) vars, t_hierarchy WHERE @id IS NOT NULL ) ho JOIN t_hierarchy hi ON hi.id = ho.id
treeitem | path | parent | lvl | is_leaf | is_cycle | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97657 | 97658/97657 | 97658 | 1 | 0 | 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
97659 | 97658/97657/97659 | 97657 | 2 | 1 | 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
97660 | 97658/97657/97660 | 97657 | 2 | 1 | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0.0003s (0.0590s) |
Note row 97659 appearing as a leaf despite being parent for 97658. That's because we use 97658 in a @start_with
condition.
I know this is an old thread, just wanted to thank you for this series, they are still awesome!
Niklas B
2 Feb 13 at 17:32
This is an old post but still usefull !
thanks a lot for the serie !
crozet
3 Aug 15 at 17:45
Thank you for this great article. It really helps me to save some time. I am going to learn your another articles also.
Denis I
6 Sep 17 at 15:17
I wonder if there is a way to manage a multiple root node forest, when “start with” clause extract a set of records and not a single one.
In Oracle, for example, we can specify a “start with” clause like “start with parent is null” to find all the root nodes of a forest.
falco
27 Feb 18 at 12:35