EXPLAIN EXTENDED

How to create fast database queries

Finding loops in a tree

with 4 comments

From Stack Overflow:

I have a view which matches up the needs and the wants, and discards the requests that cannot be fulfilled by any user, that looks like this:

Item Need Have
1 Bob George
2 George Herbie
3 Herbie Bob
4 Larry Wally
5 Wally Larry
6 John George

I am now trying to write a query where I can calculate the number of permutations of each user having their requested needs fulfilled.

For instance, in the example above, Bob, George, and Herbie can together fulfill each other's needs, as can Larry and Wally, but John cannot, so the total count would be 2.

This task is actually about finding loops.

Whenever there is a set of users that fulfill each other's needs, there is a loop in the tree.

We can assume that either of the columns (need or have) is unique so there there is always a single path in the corresponding direction which ends with at most one loop.

It's not stated in the task, but otherwise it would be an equivalent of Clique problem which is NP complete and hardly solvable with SQL.

Now, let's create a sample table:

CREATE TABLE t_needs (
id NUMBER(10) NOT NULL,
name VARCHAR2(50) NOT NULL,
needs NUMBER(10) NOT NULL
)
/

ALTER TABLE t_needs ADD CONSTRAINT pk_needs_id PRIMARY KEY (id)
/

CREATE INDEX ix_needs_needs ON t_needs (needs)
/

BEGIN
DBMS_RANDOM.seed(20090518);
END;
/

INSERT
INTO    t_needs (id, name, needs)
SELECT  level, 'User ' || level, DBMS_RANDOM.value() * 999 + 1
FROM    dual
CONNECT BY
level <= 1000
/

COMMIT
/

BEGIN
        DBMS_STATS.gather_schema_stats('"20090518_loops"');
END;
/

We have a tree table with 1,000 records and randomly filled needs (each needs corresponds to an existing user).

Now, we want to find all the loops in the table and count their lengths.

First, we should filter out the users that are not a part of any loop.

To do this, we will use Oracle's ability to detect loops in a path. We will select all paths possible (starting with each of given users) and select all records which are a part of a loop.

These records are very easy to detect: on the next step they would yield a records that had already been evaluated in the branch and therefore resides in the current recursion stack.

Oracle just examines the recursion stack on each step and sets a special condition if this step would return something that already had been returned.

Plain CONNECT BY query would be a little help to us, since any loop in this query would result in an immediate error.

But Oracle has a nice clause called CONNECT BY NOCYCLE, which does not throws an exception in case of a loop, but instead returns it in a special pseudocolumn called CONNECT_BY_ISCYCLE.

If we use it to filter the results of our query, we will get:

SELECT  id
FROM    "20090518_loops".t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id

ID
13
16
32
957
974
987
65 rows fetched in 0.0012s (0.3101s)

Now, we have a list 65 users that are self-sufficient within their loops, i. e. form a closed cycle of what they need and what they provide to each other.

But we still don't know which groups do they belong to, nor how many users are there in each group.

We can't use a simple filtering, since loops have neither beginning nor an end, and from any given row we cannot say which loop is it a part of, or how long the loop is: 2 records or 100.

But each loop, of course, has its minimal value. And we can consider this value to be a beginning of a loop. Since it's a loop we are talking about, any value is just as good as all other values.

To figure out the minimal id of each loop, we need to do the following:

  1. For each value that is a part of a loop, we should unwind the loop (traverse it in the opposite direction) and find all values that constitute the loop.
  2. For each loop, find the minimal id, and consider this minimal id the start of the loop.

This needs a little explanation. Imagine that values 2, 3, 4 and 5 form a loop (in this order).

If we start unwinding the loop from value 2, we will get the following results: 2, 5, 4, 3.

If we start unwinding it from value 3, we will get 3, 2, 5, 4, etc.

So whatever record of a loop we take as a START WITH condition, we get the same results in different order.

Since any record can be found in a loop at most once, the minimal value of the chain uniquely identifies the loop.

So this query:

SELECT  DISTINCT MIN(id)
FROM    (
SELECT  id, CONNECT_BY_ROOT(id) AS root
FROM    t_needs
START WITH
id IN
(
SELECT  id
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id
)
CONNECT BY NOCYCLE
id = PRIOR needs
)
GROUP BY
root

MIN(ID)
13
341
16
3 rows fetched in 0.0002s (0.3280s)

gives us the minimal id's in each loop.

Now all that is left to do is to repeat the query but now use the minimal id's from each group in a START WITH condition:

SELECT  CONNECT_BY_ROOT(id), level
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
START WITH
id IN
(
SELECT  MIN(id)
FROM    (
SELECT  id, CONNECT_BY_ROOT(id) AS root
FROM    t_needs
START WITH
id IN
(
SELECT  id
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id
)
CONNECT BY NOCYCLE
id = PRIOR needs
)
GROUP BY
root
)
CONNECT BY NOCYCLE
id = PRIOR needs
CONNECT_BY_ROOT(ID) LEVEL
13 54
16 8
341 3
3 rows fetched in 0.0002s (0.3296s)

That will give us the starting id and the number of records in each loop.

If we just want to browse the records (but this time knowing the loop id and the position in the loop for each record), we just get rid of WHERE condition:

SELECT  n.*, CONNECT_BY_ROOT(id), level
FROM    t_needs n
START WITH
id IN
(
SELECT  MIN(id)
FROM    (
SELECT  id, CONNECT_BY_ROOT(id) AS root
FROM    t_needs
START WITH
id IN
(
SELECT  id
FROM    t_needs n
WHERE   CONNECT_BY_ISCYCLE = 1
CONNECT BY NOCYCLE
needs = PRIOR id
)
CONNECT BY NOCYCLE
id = PRIOR needs
)
GROUP BY
root
)
CONNECT BY NOCYCLE
id = PRIOR needs
ID NAME NEEDS CONNECT_BY_ROOT(ID) LEVEL
13 User 13 81 13 1
81 User 81 421 13 2
421 User 421 531 13 3
531 User 531 154 13 4
154 User 154 957 13 5
957 User 957 110 13 6
110 User 110 586 13 7
586 User 586 375 13 8
375 User 375 146 13 9
146 User 146 465 13 10
465 User 465 591 13 11
591 User 591 337 13 12
337 User 337 172 13 13
172 User 172 40 13 14
40 User 40 797 13 15
797 User 797 60 13 16
60 User 60 974 13 17
974 User 974 615 13 18
615 User 615 93 13 19
93 User 93 496 13 20
496 User 496 262 13 21
262 User 262 795 13 22
795 User 795 336 13 23
336 User 336 576 13 24
576 User 576 851 13 25
851 User 851 89 13 26
89 User 89 79 13 27
79 User 79 628 13 28
628 User 628 125 13 29
125 User 125 419 13 30
419 User 419 32 13 31
32 User 32 728 13 32
728 User 728 130 13 33
130 User 130 718 13 34
718 User 718 342 13 35
342 User 342 686 13 36
686 User 686 833 13 37
833 User 833 610 13 38
610 User 610 567 13 39
567 User 567 807 13 40
807 User 807 915 13 41
915 User 915 270 13 42
270 User 270 604 13 43
604 User 604 338 13 44
338 User 338 726 13 45
726 User 726 515 13 46
515 User 515 445 13 47
445 User 445 354 13 48
354 User 354 187 13 49
187 User 187 410 13 50
410 User 410 609 13 51
609 User 609 987 13 52
987 User 987 268 13 53
268 User 268 13 13 54
16 User 16 158 16 1
158 User 158 799 16 2
799 User 799 748 16 3
748 User 748 884 16 4
884 User 884 629 16 5
629 User 629 472 16 6
472 User 472 222 16 7
222 User 222 16 16 8
341 User 341 461 341 1
461 User 461 513 341 2
513 User 513 341 341 3
65 rows fetched in 0.0031s (0.3330s)

Now we have the loops completely identified.

Written by Quassnoi

May 18th, 2009 at 11:00 pm

Posted in Oracle

4 Responses to 'Finding loops in a tree'

Subscribe to comments with RSS

  1. Helped me a lot :)

    Still i am having performance issue, connect queires are really slow, do you have any materials regarding Optimization ?

    MS

    15 Jan 14 at 19:22

  2. @MS: yes, my whole blog is about optimization.

    Quassnoi

    16 Jan 14 at 11:05

  3. Hi!

    I have a structure similar to yours. The number of records in which I’m looking for loops is 457. The relevant columns are:

    tree_instance_id | node_id | input | actiondata1

    The connection is between the node_id and actiondata1 columns: actiondata1 column contains a node_id. One node can be connected to several others through different inputs, e.g:

    tree_instance_id | node_id | input | actiondata1
    ————————————————-
    150006 | 150001 | input1 | 150002
    150006 | 150001 | input2 | 150003
    150006 | 150002 | input1 | 150004

    I tried your query, however it took extremely long, I cancelled it after 20 minutes. I tried the following query first:

    SELECT node_id
    FROM inputhandler
    WHERE CONNECT_BY_ISCYCLE = 1
    AND tree_instance_id = 150006
    CONNECT BY NOCYCLE actiondata1 = PRIOR node_id;

    And it kept returning the same node_ids over and over, several thousand times. I added a DISTINCT to the node_id, and then it returned 94 node_ids, but it took almost 5 minutes. I inserted those node_ids into your second query, so that the inner query is replaced by static values, like this:

    SELECT DISTINCT MIN(node_id)
    FROM (
    SELECT node_id, CONNECT_BY_ROOT(node_id) AS root
    FROM inputhandler
    START WITH node_id IN (150001, 150149, 150020, …)
    CONNECT BY NOCYCLE node_id = PRIOR actiondata1
    )
    GROUP BY root;

    It’s running for close to 30 minutes now, and no result yet.

    My understanding is that the DISTINCT makes it slow, but without that it returns the same node_ids several times. What could be wrong here?

    MG

    7 Nov 16 at 13:36

  4. It’s me again. I just created the same table as in this post, and filled it. 1000 rows, first query took 0.09 seconds, and returned 55 rows. Second query took 77 seconds, and returned 5 rows. Third query is running for 10 minutes now, and no result yet. It’s strange, because the post says that queries took only a fraction of a second.

    I managed to make it faster by using the same idea as with my own queries: executing them one-by-one, and using the results as static “input” for the other queries. So, for instance, this query only took 0.026 seconds:

    SELECT n.*, CONNECT_BY_ROOT(id), level
    FROM t_needs n
    START WITH id IN (
    83, 14, 36, 7, 169
    )
    CONNECT BY NOCYCLE id = PRIOR needs;

    But if I run the query as it is in your post, with all the sub-queries included, it takes forever to execute. Why could that be?

    MG

    7 Nov 16 at 15:06

Leave a Reply