EXPLAIN EXTENDED

How to create fast database queries

Oracle: Selecting records holding group-wise maximum

with one comment

It was quite a time since I posted my last article, and I have a bunch of unanswered questions that people asked me.

Sorry for not posting and answering for so long, had a few urgent things to do, and hope these breaks won't be so long in the future :)

Now, continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

Finally, Oracle.

We will create the very same table as we did for the previous systems:

Table creation details

There are 1,000,000 records with the following fields:

  • id is the PRIMARY KEY
  • orderer is filled with random values from 1 to 10
  • glow is a low cardinality grouping field (10 distinct values)
  • ghigh is a high cardinality grouping field (10,000 distinct values)
  • stuffing is an asterisk-filled VARCHAR(200) column added to emulate payload of the actual tables

Analytic functions

Of course Oracle support analytic functions just like SQL Server and PostgreSQL (and actually, that's SQL Server and PostgreSQL that support analytics functions like Oracle).

To select records holding the group-wise maximums, we can apply the very same ROW_NUMBER and DENSE_RANK solutions we used for SQL Server and PostgreSQL before:

SELECT  id, orderer, glow, ghigh
FROM    (
SELECT  d.*, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn
FROM    t_distinct d
) q
WHERE   rn = 1

View query details

SELECT  COUNT(*) AS cnt, SUM(LENGTH(stuffing)) AS psum
FROM    (
SELECT  d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr
FROM    t_distinct d
) dd
WHERE   dr = 1

View query details

This takes a little more than 6 seconds.

CONNECT BY and subqueries

Unfortunately in Oracle we cannot use the decisions we used to improve the query time in the other systems.

Oracle does not support loose index scan on composite indexes like MySQL.

It does support a similar solution, INDEX SKIP SCAN, but can use it only for the queries that search for the exact value of the orderer, not for MIN or MAX.

Oracle does not support CROSS APPLY so we cannot apply a multirecord or a multicolumn query to each result of the rowset.

More than that: Oracle does not event support deep correlation of the subqueries. You cannot nest a reference to the outer table in the subquery, and using ROW_NUMBER or ROWNUM to filter the first result requires nesting. So making an analog of MySQL's LIMIT in a subquery won't work either.

Oracle 10g (which is what I'm writing about here) also does not support recursive CTE's, so it's not possible to use this method too.

However, there still is a way to improve the query.

Instead of jumping over the index values to build the list of distinct groupers, we can build a whole list of the possible groupers and join it with the original table.

The possible groupers are or course those between MIN and MAX.

To build such a list we can use a CONNECT BY query over the dual table, (which is a standard way to generate random rowsets in Oracle), and just take MIN(glow) and MAX(glow) as the start and stop conditions.

We then can find the MIN(orderer) or MIN(id) for each grouper from the generated list and join the table back on this value. This will use an efficient INDEX RANGE SCAN (MIN / MAX). There can be gaps: not all groupers will be present in the list, but this is not a problem, since the subquery will return a NULL and the join will just yield no rows.

Here's the query to find the records holding the MIN(id):

WITH    q AS
(
SELECT  level - 1 +
(
SELECT  MIN(glow)
FROM    t_distinct
) AS d
FROM    dual
CONNECT BY
level <=
                (
                SELECT  MAX(glow)
                FROM    t_distinct
                ) - 
                (
                SELECT  MIN(glow)
                FROM    t_distinct
                ) + 1
        )
SELECT  d.id, d.orderer, d.glow, d.glow
FROM    q
JOIN    t_distinct d
ON      d.id =
        (
        SELECT  MAX(id)
        FROM    t_distinct di
        WHERE   di.glow = q.d
        )
ID ORDERER GLOW GLOW
999991 1 1 1
999992 6 2 2
999993 8 3 3
999994 5 4 4
999995 10 5 5
999996 9 6 6
999997 4 7 7
999998 3 8 8
999999 1 9 9
1000000 4 10 10
10 rows fetched in 0.0005s (0.0009s)
SELECT STATEMENT 
 NESTED LOOPS 
  VIEW 
   CONNECT BY WITHOUT FILTERING
    FAST DUAL 
  TABLE ACCESS BY INDEX ROWID, 20091216_groupwise.T_DISTINCT
   INDEX UNIQUE SCAN, 20091216_groupwise.PK_DISTINCT_ID
    SORT AGGREGATE
     FIRST ROW 
      INDEX RANGE SCAN (MIN/MAX), 20091216_groupwise.IX_DISTINCT_GLOW_ID

, and this is a query to find the records holding MAX(id) within the MIN(orderer):

WITH    q AS
(
SELECT  level - 1 +
(
SELECT  MIN(glow)
FROM    t_distinct
) AS d
FROM    dual
CONNECT BY
level <=
                (
                SELECT  MAX(glow)
                FROM    t_distinct
                ) - 
                (
                SELECT  MIN(glow)
                FROM    t_distinct
                ) + 1
        )
SELECT  d.id, d.orderer, d.glow, d.glow
FROM    (
        SELECT  d,
                (
                SELECT  MIN(orderer)
                FROM    t_distinct d1
                WHERE   d1.glow = q.d
                ) AS mo
        FROM    q
        ) q2
JOIN    t_distinct d
ON      d.id = 
        (
        SELECT  MAX(id)
        FROM    t_distinct d2
        WHERE   d2.glow = q2.d
                AND d2.orderer = q2.mo
        )
ID ORDERER GLOW GLOW
999991 1 1 1
999852 1 2 2
999543 1 3 3
999974 1 4 4
999825 1 5 5
999816 1 6 6
999707 1 7 7
999818 1 8 8
999999 1 9 9
999960 1 10 10
10 rows fetched in 0.0005s (0.0010s)
SELECT STATEMENT 
 NESTED LOOPS 
  VIEW 
   VIEW 
    CONNECT BY WITHOUT FILTERING
     FAST DUAL 
  TABLE ACCESS BY INDEX ROWID, 20091216_groupwise.T_DISTINCT
   INDEX UNIQUE SCAN, 20091216_groupwise.PK_DISTINCT_ID
    SORT AGGREGATE
     FIRST ROW 
      INDEX RANGE SCAN (MIN/MAX), 20091216_groupwise.IX_DISTINCT_GLOW_ORDERER_ID

The latter query is a little bit more complex since we need two subqueries here: one to find the MIN(orderer), another one to find the MAX(id). The principle is the same, however.

Both subqueries are very fast and complete within 1 millisecond (read instantly).

This query depends not only on the selectivity of the grouper but also on the actual values of the bounds (the higher is the difference between MIN and MAX on the groupers, the less efficient the query is).

For this solution to work, the groupers should have a well-ordered datatype. This means you can build a list of all possible values between MIN and MAX. Integers and dates (without time portion) are well-ordered; real numbers, strings and timestamps are not.

For this solution to be efficient, the groupers should be well-ordered, dense and have low cardinality. Though these limitations can be quite strong, the solutions is still extremely useful for very many many real-world scenarios which groups by integers (like categories in the blogging system or department codes in an accounting application); or dates without time portion (like sales reports).

Written by Quassnoi

December 16th, 2009 at 11:00 pm

Posted in Oracle

One Response to 'Oracle: Selecting records holding group-wise maximum'

Subscribe to comments with RSS

  1. Excellent article!

    There is a way to overcome the need to generate the range for grouping column.
    Let’s say we add column glow_txt to t_distinct_table. Column is generated as following: ‘A’ || to_char(glow);
    And let’s build another index:
    CREATE INDEX ix_distinct_glow_txt_id ON t_distinct (glow_txt, id)
    /
    Now we can determine distinct grouping values with following query:
    with T (glow_txt) as
    (select /*+ distinct_glow_txt_minmax */
    min (glow_txt) glow_txt from T_distinct
    union all
    select (select min (T1.glow_txt)
    from T_distinct T1
    where T1.glow_txt > T.glow_txt)
    from T
    where T.glow_txt is not null)
    select *
    from T
    where t.glow_txt is not null;
    First time this technique mentioned in http://orasql.org/2012/09/21/distinct-values-by-index-topn/
    And we can use it in your example:
    SELECT /*+ rows_with_min_id_for_every_glow_txt_minmax */
    d.id, d.orderer, d.glow, d.ghigh
    FROM (with T (glow_txt) as
    (select
    min (glow_txt) glow_txt from T_distinct
    union all
    select (select min (T1.glow_txt)
    from T_distinct T1
    where T1.glow_txt > T.glow_txt)
    from T
    where T.glow_txt is not null)
    select *
    from T
    where t.glow_txt is not null) q
    JOIN t_distinct d
    ON d.id =
    (
    SELECT MIN(id)
    FROM t_distinct di
    WHERE di.glow_txt = q.glow_txt
    )
    ;

    Please note that all mentioned above works in 11gR2.

    Yakov Vasylchenko

    31 Oct 13 at 19:15

Leave a Reply