Finding loops in a tree at EXPLAIN EXTENDED

I have a view which matches up the needs and the wants, and discards the requests that cannot be fulfilled by any user, that looks like this:

Item Need Have

1 Bob George

2 George Herbie

3 Herbie Bob

4 Larry Wally

5 Wally Larry

6 John George

I am now trying to write a query where I can calculate the number of permutations of each user having their requested needs fulfilled.

For instance, in the example above, Bob, George, and Herbie can together fulfill each other's needs, as can Larry and Wally, but John cannot, so the total count would be 2.

Item	Need	Have
1	Bob	George
2	George	Herbie
3	Herbie	Bob
4	Larry	Wally
5	Wally	Larry
6	John	George

This task is actually about finding loops.

Whenever there is a set of users that fulfill each other's needs, there is a loop in the tree.

We can assume that either of the columns (need or have) is unique so there there is always a single path in the corresponding direction which ends with at most one loop.

It's not stated in the task, but otherwise it would be an equivalent of Clique problem which is NP complete and hardly solvable with SQL.

Now, let's create a sample table:

CREATE TABLE t_needs (
id NUMBER(10) NOT NULL,
name VARCHAR2(50) NOT NULL,
needs NUMBER(10) NOT NULL
)
/

ALTER TABLE t_needs ADD CONSTRAINT pk_needs_id PRIMARY KEY (id)
/

CREATE INDEX ix_needs_needs ON t_needs (needs)
/

BEGIN
DBMS_RANDOM.seed(20090518);
END;
/

INSERT
INTO    t_needs (id, name, needs)
SELECT  level, 'User ' || level, DBMS_RANDOM.value() * 999 + 1
FROM    dual
CONNECT BY
level &lt;= 1000
/

COMMIT
/

BEGIN
        DBMS_STATS.gather_schema_stats('"20090518_loops"');
END;
/

We have a tree table with 1,000 records and randomly filled needs (each needs corresponds to an existing user).

Now, we want to find all the loops in the table and count their lengths.

First, we should filter out the users that are not a part of any loop.

To do this, we will use Oracle's ability to detect loops in a path. We will select all paths possible (starting with each of given users) and select all records which are a part of a loop.

These records are very easy to detect: on the next step they would yield a records that had already been evaluated in the branch and therefore resides in the current recursion stack.

Oracle just examines the recursion stack on each step and sets a special condition if this step would return something that already had been returned.

Plain CONNECT BY query would be a little help to us, since any loop in this query would result in an immediate error.

But Oracle has a nice clause called CONNECT BY NOCYCLE, which does not throws an exception in case of a loop, but instead returns it in a special pseudocolumn called CONNECT_BY_ISCYCLE.

If we use it to filter the results of our query, we will get:

SELECT  id
FROM    "20090518_loops".t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id

ID
13
16
32

957
974
987
65 rows fetched in 0.0012s (0.3101s)

Now, we have a list 65 users that are self-sufficient within their loops, i. e. form a closed cycle of what they need and what they provide to each other.

But we still don't know which groups do they belong to, nor how many users are there in each group.

We can't use a simple filtering, since loops have neither beginning nor an end, and from any given row we cannot say which loop is it a part of, or how long the loop is: 2 records or 100.

But each loop, of course, has its minimal value. And we can consider this value to be a beginning of a loop. Since it's a loop we are talking about, any value is just as good as all other values.

To figure out the minimal id of each loop, we need to do the following:

For each value that is a part of a loop, we should unwind the loop (traverse it in the opposite direction) and find all values that constitute the loop.
For each loop, find the minimal id, and consider this minimal id the start of the loop.

This needs a little explanation. Imagine that values 2, 3, 4 and 5 form a loop (in this order).

If we start unwinding the loop from value 2, we will get the following results: 2, 5, 4, 3.

If we start unwinding it from value 3, we will get 3, 2, 5, 4, etc.

So whatever record of a loop we take as a START WITH condition, we get the same results in different order.

Since any record can be found in a loop at most once, the minimal value of the chain uniquely identifies the loop.

So this query:

SELECT  DISTINCT MIN(id)
FROM    (
SELECT  id, CONNECT_BY_ROOT(id) AS root
FROM    t_needs
START WITH
id IN
(
SELECT  id
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id
)
CONNECT BY NOCYCLE
id = PRIOR needs
)
GROUP BY
root

MIN(ID)
13
341
16
3 rows fetched in 0.0002s (0.3280s)

gives us the minimal id's in each loop.

Now all that is left to do is to repeat the query but now use the minimal id's from each group in a START WITH condition:

SELECT  CONNECT_BY_ROOT(id), level
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
START WITH
id IN
(
SELECT  MIN(id)
FROM    (
SELECT  id, CONNECT_BY_ROOT(id) AS root
FROM    t_needs
START WITH
id IN
(
SELECT  id
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id
)
CONNECT BY NOCYCLE
id = PRIOR needs
)
GROUP BY
root
)
CONNECT BY NOCYCLE
id = PRIOR needs

CONNECT_BY_ROOT(ID)	LEVEL
13	54
16	8
341	3
3 rows fetched in 0.0002s (0.3296s)

That will give us the starting id and the number of records in each loop.

If we just want to browse the records (but this time knowing the loop id and the position in the loop for each record), we just get rid of WHERE condition:

SELECT  n.*, CONNECT_BY_ROOT(id), level
FROM    t_needs n
START WITH
id IN
(
SELECT  MIN(id)
FROM    (
SELECT  id, CONNECT_BY_ROOT(id) AS root
FROM    t_needs
START WITH
id IN
(
SELECT  id
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id
)
CONNECT BY NOCYCLE
id = PRIOR needs
)
GROUP BY
root
)
CONNECT BY NOCYCLE
id = PRIOR needs

ID	NAME	NEEDS	CONNECT_BY_ROOT(ID)	LEVEL
13	User 13	81	13	1
81	User 81	421	13	2
421	User 421	531	13	3
531	User 531	154	13	4
154	User 154	957	13	5
957	User 957	110	13	6
110	User 110	586	13	7
586	User 586	375	13	8
375	User 375	146	13	9
146	User 146	465	13	10
465	User 465	591	13	11
591	User 591	337	13	12
337	User 337	172	13	13
172	User 172	40	13	14
40	User 40	797	13	15
797	User 797	60	13	16
60	User 60	974	13	17
974	User 974	615	13	18
615	User 615	93	13	19
93	User 93	496	13	20
496	User 496	262	13	21
262	User 262	795	13	22
795	User 795	336	13	23
336	User 336	576	13	24
576	User 576	851	13	25
851	User 851	89	13	26
89	User 89	79	13	27
79	User 79	628	13	28
628	User 628	125	13	29
125	User 125	419	13	30
419	User 419	32	13	31
32	User 32	728	13	32
728	User 728	130	13	33
130	User 130	718	13	34
718	User 718	342	13	35
342	User 342	686	13	36
686	User 686	833	13	37
833	User 833	610	13	38
610	User 610	567	13	39
567	User 567	807	13	40
807	User 807	915	13	41
915	User 915	270	13	42
270	User 270	604	13	43
604	User 604	338	13	44
338	User 338	726	13	45
726	User 726	515	13	46
515	User 515	445	13	47
445	User 445	354	13	48
354	User 354	187	13	49
187	User 187	410	13	50
410	User 410	609	13	51
609	User 609	987	13	52
987	User 987	268	13	53
268	User 268	13	13	54
16	User 16	158	16	1
158	User 158	799	16	2
799	User 799	748	16	3
748	User 748	884	16	4
884	User 884	629	16	5
629	User 629	472	16	6
472	User 472	222	16	7
222	User 222	16	16	8
341	User 341	461	341	1
461	User 461	513	341	2
513	User 513	341	341	3
65 rows fetched in 0.0031s (0.3330s)

Now we have the loops completely identified.

Written by Quassnoi

May 18th, 2009 at 11:00 pm

Posted in Oracle

4 Responses to 'Finding loops in a tree'

Subscribe to comments with RSS

Helped me a lot :)

Still i am having performance issue, connect queires are really slow, do you have any materials regarding Optimization ?

MS

15 Jan 14 at 19:22
@MS: yes, my whole blog is about optimization.

Quassnoi

16 Jan 14 at 11:05
Hi!

I have a structure similar to yours. The number of records in which I’m looking for loops is 457. The relevant columns are:

tree_instance_id | node_id | input | actiondata1

The connection is between the node_id and actiondata1 columns: actiondata1 column contains a node_id. One node can be connected to several others through different inputs, e.g:

tree_instance_id | node_id | input | actiondata1
————————————————-
150006 | 150001 | input1 | 150002
150006 | 150001 | input2 | 150003
150006 | 150002 | input1 | 150004

I tried your query, however it took extremely long, I cancelled it after 20 minutes. I tried the following query first:

SELECT node_id
FROM inputhandler
WHERE CONNECT_BY_ISCYCLE = 1
AND tree_instance_id = 150006
CONNECT BY NOCYCLE actiondata1 = PRIOR node_id;

And it kept returning the same node_ids over and over, several thousand times. I added a DISTINCT to the node_id, and then it returned 94 node_ids, but it took almost 5 minutes. I inserted those node_ids into your second query, so that the inner query is replaced by static values, like this:

SELECT DISTINCT MIN(node_id)
FROM (
SELECT node_id, CONNECT_BY_ROOT(node_id) AS root
FROM inputhandler
START WITH node_id IN (150001, 150149, 150020, …)
CONNECT BY NOCYCLE node_id = PRIOR actiondata1
)
GROUP BY root;

It’s running for close to 30 minutes now, and no result yet.

My understanding is that the DISTINCT makes it slow, but without that it returns the same node_ids several times. What could be wrong here?

MG

7 Nov 16 at 13:36
It’s me again. I just created the same table as in this post, and filled it. 1000 rows, first query took 0.09 seconds, and returned 55 rows. Second query took 77 seconds, and returned 5 rows. Third query is running for 10 minutes now, and no result yet. It’s strange, because the post says that queries took only a fraction of a second.

I managed to make it faster by using the same idea as with my own queries: executing them one-by-one, and using the results as static “input” for the other queries. So, for instance, this query only took 0.026 seconds:

SELECT n.*, CONNECT_BY_ROOT(id), level
FROM t_needs n
START WITH id IN (
83, 14, 36, 7, 169
)
CONNECT BY NOCYCLE id = PRIOR needs;

But if I run the query as it is in your post, with all the sub-queries included, it takes forever to execute. Why could that be?

MG

7 Nov 16 at 15:06

EXPLAIN EXTENDED