It was quite a time since I posted my last article, and I have a bunch of unanswered questions that people asked me.
Sorry for not posting and answering for so long, had a few urgent things to do, and hope these breaks won’t be so long in the future :)
Now, continuing the series on selecting records holding group-wise maximums:
How do I select the whole records, grouped on
grouperand holding a group-wise maximum (or minimum) on other column?
We will create the very same table as we did for the previous systems:
Table creation details
CREATE TABLE t_distinct ( id NOT NULL, orderer NOT NULL, glow NOT NULL, ghigh NOT NULL, stuffing NOT NULL ) AS SELECT level, TRUNC(DBMS_RANDOM.value * 10) + 1, MOD(level - 1, 10) + 1, MOD(level - 1, 10000) + 1, CAST(RPAD('*', 200, '*') AS VARCHAR2(200)) FROM dual CONNECT BY level <= 1000000 / ALTER TABLE t_distinct ADD CONSTRAINT pk_distinct_id PRIMARY KEY (id) / CREATE INDEX ix_distinct_glow_id ON t_distinct (glow, id) / CREATE INDEX ix_distinct_ghigh_id ON t_distinct (ghigh, id) / CREATE INDEX ix_distinct_glow_orderer_id ON t_distinct (glow, orderer, id) / CREATE INDEX ix_distinct_ghigh_orderer_id ON t_distinct (ghigh, orderer, id) /
There are 1,000,000 records with the following fields:
ordereris filled with random values from 1 to 10
glowis a low cardinality grouping field (10 distinct values)
ghighis a high cardinality grouping field (10,000 distinct values)
stuffingis an asterisk-filled
VARCHAR(200)column added to emulate payload of the actual tables
Of course Oracle support analytic functions just like SQL Server and PostgreSQL (and actually, that’s SQL Server and PostgreSQL that support analytics functions like Oracle).
To select records holding the group-wise maximums, we can apply the very same
DENSE_RANK solutions we used for SQL Server and PostgreSQL before:
SELECT id, orderer, glow, ghigh FROM ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY glow ORDER BY id) AS rn FROM t_distinct d ) q WHERE rn = 1
|10 rows fetched in 0.0005s (6.9528s)|
SELECT STATEMENT VIEW WINDOW SORT PUSHED RANK TABLE ACCESS FULL, 20091216_groupwise.T_DISTINCT
SELECT COUNT(*) AS cnt, SUM(LENGTH(stuffing)) AS psum FROM ( SELECT d.*, DENSE_RANK() OVER (PARTITION BY glow ORDER BY orderer) AS dr FROM t_distinct d ) dd WHERE dr = 1
|1 row fetched in 0.0001s (6.7498s)|
SELECT STATEMENT SORT AGGREGATE VIEW WINDOW SORT PUSHED RANK TABLE ACCESS FULL, 20091216_groupwise.T_DISTINCT
This takes a little more than 6 seconds.
CONNECT BY and subqueries
Unfortunately in Oracle we cannot use the decisions we used to improve the query time in the other systems.
Oracle does not support loose index scan on composite indexes like MySQL.
It does support a similar solution,
INDEX SKIP SCAN, but can use it only for the queries that search for the exact value of the
orderer, not for
Oracle does not support
CROSS APPLY so we cannot apply a multirecord or a multicolumn query to each result of the rowset.
More than that: Oracle does not event support deep correlation of the subqueries. You cannot nest a reference to the outer table in the subquery, and using
ROWNUM to filter the first result requires nesting. So making an analog of MySQL‘s
LIMIT in a subquery won’t work either.
Oracle 10g (which is what I’m writing about here) also does not support recursive CTE‘s, so it’s not possible to use this method too.
However, there still is a way to improve the query.
Instead of jumping over the index values to build the list of distinct groupers, we can build a whole list of the possible groupers and join it with the original table.
possible groupers are or course those between
To build such a list we can use a
CONNECT BY query over the
dual table, (which is a standard way to generate random rowsets in Oracle), and just take
MAX(glow) as the start and stop conditions.
We then can find the
MIN(id) for each grouper from the generated list and join the table back on this value. This will use an efficient
INDEX RANGE SCAN (MIN / MAX). There can be gaps: not all groupers will be present in the list, but this is not a problem, since the subquery will return a
NULL and the join will just yield no rows.
Here’s the query to find the records holding the
WITH q AS ( SELECT level - 1 + ( SELECT MIN(glow) FROM t_distinct ) AS d FROM dual CONNECT BY level <= ( SELECT MAX(glow) FROM t_distinct ) - ( SELECT MIN(glow) FROM t_distinct ) + 1 ) SELECT d.id, d.orderer, d.glow, d.glow FROM q JOIN t_distinct d ON d.id = ( SELECT MAX(id) FROM t_distinct di WHERE di.glow = q.d )
|10 rows fetched in 0.0005s (0.0009s)|
SELECT STATEMENT NESTED LOOPS VIEW CONNECT BY WITHOUT FILTERING FAST DUAL TABLE ACCESS BY INDEX ROWID, 20091216_groupwise.T_DISTINCT INDEX UNIQUE SCAN, 20091216_groupwise.PK_DISTINCT_ID SORT AGGREGATE FIRST ROW INDEX RANGE SCAN (MIN/MAX), 20091216_groupwise.IX_DISTINCT_GLOW_ID
, and this is a query to find the records holding
MAX(id) within the
WITH q AS ( SELECT level - 1 + ( SELECT MIN(glow) FROM t_distinct ) AS d FROM dual CONNECT BY level <= ( SELECT MAX(glow) FROM t_distinct ) - ( SELECT MIN(glow) FROM t_distinct ) + 1 ) SELECT d.id, d.orderer, d.glow, d.glow FROM ( SELECT d, ( SELECT MIN(orderer) FROM t_distinct d1 WHERE d1.glow = q.d ) AS mo FROM q ) q2 JOIN t_distinct d ON d.id = ( SELECT MAX(id) FROM t_distinct d2 WHERE d2.glow = q2.d AND d2.orderer = q2.mo )
|10 rows fetched in 0.0005s (0.0010s)|
SELECT STATEMENT NESTED LOOPS VIEW VIEW CONNECT BY WITHOUT FILTERING FAST DUAL TABLE ACCESS BY INDEX ROWID, 20091216_groupwise.T_DISTINCT INDEX UNIQUE SCAN, 20091216_groupwise.PK_DISTINCT_ID SORT AGGREGATE FIRST ROW INDEX RANGE SCAN (MIN/MAX), 20091216_groupwise.IX_DISTINCT_GLOW_ORDERER_ID
The latter query is a little bit more complex since we need two subqueries here: one to find the
MIN(orderer), another one to find the
MAX(id). The principle is the same, however.
Both subqueries are very fast and complete within 1 millisecond (read instantly).
This query depends not only on the selectivity of the grouper but also on the actual values of the bounds (the higher is the difference between
MAX on the groupers, the less efficient the query is).
For this solution to work, the groupers should have a well-ordered datatype. This means you can build a list of all possible values between
MAX. Integers and dates (without time portion) are well-ordered; real numbers, strings and timestamps are not.
For this solution to be efficient, the groupers should be well-ordered, dense and have low cardinality. Though these limitations can be quite strong, the solutions is still extremely useful for very many many real-world scenarios which groups by integers (like categories in the blogging system or department codes in an accounting application); or dates without time portion (like sales reports).