Oracle: Selecting records holding group-wise maximum
It was quite a time since I posted my last article, and I have a bunch of unanswered questions that people asked me.
Sorry for not posting and answering for so long, had a few urgent things to do, and hope these breaks won't be so long in the future :)
Now, continuing the series on selecting records holding group-wise maximums:
How do I select the whole records, grouped on
grouper
and holding a group-wise maximum (or minimum) on other column?
Finally, Oracle.
We will create the very same table as we did for the previous systems:
Table creation details
There are 1,000,000 records with the following fields:
id
is thePRIMARY KEY
orderer
is filled with random values from 1 to 10glow
is a low cardinality grouping field (10 distinct values)ghigh
is a high cardinality grouping field (10,000 distinct values)stuffing
is an asterisk-filledVARCHAR(200)
column added to emulate payload of the actual tables
Analytic functions
Of course Oracle support analytic functions just like SQL Server and PostgreSQL (and actually, that's SQL Server and PostgreSQL that support analytics functions like Oracle).
To select records holding the group-wise maximums, we can apply the very same ROW_NUMBER
and DENSE_RANK
solutions we used for SQL Server and PostgreSQL before:
SELECT id, orderer, glow, ghigh FROM ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn FROM t_distinct d ) q WHERE rn = 1
SELECT COUNT(*) AS cnt, SUM(LENGTH(stuffing)) AS psum FROM ( SELECT d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr FROM t_distinct d ) dd WHERE dr = 1
This takes a little more than 6 seconds.
CONNECT BY
and subqueries
Unfortunately in Oracle we cannot use the decisions we used to improve the query time in the other systems.
Oracle does not support loose index scan on composite indexes like MySQL.
It does support a similar solution, INDEX SKIP SCAN
, but can use it only for the queries that search for the exact value of the orderer
, not for MIN
or MAX
.
Oracle does not support CROSS APPLY
so we cannot apply a multirecord or a multicolumn query to each result of the rowset.
More than that: Oracle does not event support deep correlation of the subqueries. You cannot nest a reference to the outer table in the subquery, and using ROW_NUMBER
or ROWNUM
to filter the first result requires nesting. So making an analog of MySQL's LIMIT
in a subquery won't work either.
Oracle 10g (which is what I'm writing about here) also does not support recursive CTE's, so it's not possible to use this method too.
However, there still is a way to improve the query.
Instead of jumping over the index values to build the list of distinct groupers, we can build a whole list of the possible groupers and join it with the original table.
The possible
groupers are or course those between MIN
and MAX
.
To build such a list we can use a CONNECT BY
query over the dual
table, (which is a standard way to generate random rowsets in Oracle), and just take MIN(glow)
and MAX(glow)
as the start and stop conditions.
We then can find the MIN(orderer)
or MIN(id)
for each grouper from the generated list and join the table back on this value. This will use an efficient INDEX RANGE SCAN (MIN / MAX)
. There can be gaps: not all groupers will be present in the list, but this is not a problem, since the subquery will return a NULL
and the join will just yield no rows.
Here's the query to find the records holding the MIN(id)
:
WITH q AS ( SELECT level - 1 + ( SELECT MIN(glow) FROM t_distinct ) AS d FROM dual CONNECT BY level <= ( SELECT MAX(glow) FROM t_distinct ) - ( SELECT MIN(glow) FROM t_distinct ) + 1 ) SELECT d.id, d.orderer, d.glow, d.glow FROM q JOIN t_distinct d ON d.id = ( SELECT MAX(id) FROM t_distinct di WHERE di.glow = q.d )
ID | ORDERER | GLOW | GLOW | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999991 | 1 | 1 | 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999992 | 6 | 2 | 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999993 | 8 | 3 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999994 | 5 | 4 | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999995 | 10 | 5 | 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999996 | 9 | 6 | 6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999997 | 4 | 7 | 7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999998 | 3 | 8 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999999 | 1 | 9 | 9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1000000 | 4 | 10 | 10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 rows fetched in 0.0005s (0.0009s) |
SELECT STATEMENT NESTED LOOPS VIEW CONNECT BY WITHOUT FILTERING FAST DUAL TABLE ACCESS BY INDEX ROWID, 20091216_groupwise.T_DISTINCT INDEX UNIQUE SCAN, 20091216_groupwise.PK_DISTINCT_ID SORT AGGREGATE FIRST ROW INDEX RANGE SCAN (MIN/MAX), 20091216_groupwise.IX_DISTINCT_GLOW_ID
, and this is a query to find the records holding MAX(id)
within the MIN(orderer)
:
WITH q AS ( SELECT level - 1 + ( SELECT MIN(glow) FROM t_distinct ) AS d FROM dual CONNECT BY level <= ( SELECT MAX(glow) FROM t_distinct ) - ( SELECT MIN(glow) FROM t_distinct ) + 1 ) SELECT d.id, d.orderer, d.glow, d.glow FROM ( SELECT d, ( SELECT MIN(orderer) FROM t_distinct d1 WHERE d1.glow = q.d ) AS mo FROM q ) q2 JOIN t_distinct d ON d.id = ( SELECT MAX(id) FROM t_distinct d2 WHERE d2.glow = q2.d AND d2.orderer = q2.mo )
ID | ORDERER | GLOW | GLOW | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
999991 | 1 | 1 | 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999852 | 1 | 2 | 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999543 | 1 | 3 | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999974 | 1 | 4 | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999825 | 1 | 5 | 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999816 | 1 | 6 | 6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999707 | 1 | 7 | 7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999818 | 1 | 8 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999999 | 1 | 9 | 9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
999960 | 1 | 10 | 10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 rows fetched in 0.0005s (0.0010s) |
SELECT STATEMENT NESTED LOOPS VIEW VIEW CONNECT BY WITHOUT FILTERING FAST DUAL TABLE ACCESS BY INDEX ROWID, 20091216_groupwise.T_DISTINCT INDEX UNIQUE SCAN, 20091216_groupwise.PK_DISTINCT_ID SORT AGGREGATE FIRST ROW INDEX RANGE SCAN (MIN/MAX), 20091216_groupwise.IX_DISTINCT_GLOW_ORDERER_ID
The latter query is a little bit more complex since we need two subqueries here: one to find the MIN(orderer)
, another one to find the MAX(id)
. The principle is the same, however.
Both subqueries are very fast and complete within 1 millisecond (read instantly).
This query depends not only on the selectivity of the grouper but also on the actual values of the bounds (the higher is the difference between MIN
and MAX
on the groupers, the less efficient the query is).
For this solution to work, the groupers should have a well-ordered datatype. This means you can build a list of all possible values between MIN
and MAX
. Integers and dates (without time portion) are well-ordered; real numbers, strings and timestamps are not.
For this solution to be efficient, the groupers should be well-ordered, dense and have low cardinality. Though these limitations can be quite strong, the solutions is still extremely useful for very many many real-world scenarios which groups by integers (like categories in the blogging system or department codes in an accounting application); or dates without time portion (like sales reports).
Excellent article!
There is a way to overcome the need to generate the range for grouping column.
Let’s say we add column glow_txt to t_distinct_table. Column is generated as following: ‘A’ || to_char(glow);
And let’s build another index:
CREATE INDEX ix_distinct_glow_txt_id ON t_distinct (glow_txt, id)
/
Now we can determine distinct grouping values with following query:
with T (glow_txt) as
(select /*+ distinct_glow_txt_minmax */
min (glow_txt) glow_txt from T_distinct
union all
select (select min (T1.glow_txt)
from T_distinct T1
where T1.glow_txt > T.glow_txt)
from T
where T.glow_txt is not null)
select *
from T
where t.glow_txt is not null;
First time this technique mentioned in http://orasql.org/2012/09/21/distinct-values-by-index-topn/
And we can use it in your example:
SELECT /*+ rows_with_min_id_for_every_glow_txt_minmax */
d.id, d.orderer, d.glow, d.ghigh
FROM (with T (glow_txt) as
(select
min (glow_txt) glow_txt from T_distinct
union all
select (select min (T1.glow_txt)
from T_distinct T1
where T1.glow_txt > T.glow_txt)
from T
where T.glow_txt is not null)
select *
from T
where t.glow_txt is not null) q
JOIN t_distinct d
ON d.id =
(
SELECT MIN(id)
FROM t_distinct di
WHERE di.glow_txt = q.glow_txt
)
;
Please note that all mentioned above works in 11gR2.
Yakov Vasylchenko
31 Oct 13 at 19:15