EXPLAIN EXTENDED

How to create fast database queries

Calculating mode

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query:

WITH    CTE AS (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER(PARTITION BY e_id) AS cn
        FROM    ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        AVG(scale) AS [AVG],
        STDEV(scale) AS [StdDev],
        AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
        MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]
FROM    CTE
GROUP BY
        e_id

How do I add Mode to this query?

A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.

In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.

Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.

This means that we should define a way on how to choose this mode.

There can be three approaches to this:

  1. Return every modal value
  2. Return a single modal value
  3. Return an aggregate of all modal values

To check all queries, we will generate a simple trimodal dataset:

Table creation details

The basic idea is simple: we should find a value that is held by the maximum number or records. To do this, we need to calculate the number of records sharing a given value of scale. We do this by adding a COUNT as an analytical function into the CTE:

WITH    cte AS
        (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )

Single value

In this case, we just pick a minimal modal value.

To do this, we group the results returned by e_id (just like the original query does) and select the TOP 1 scale ordered by sn DESC, scale in a subquery:

WITH    cte AS
        (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        (
        SELECT  TOP 1 scale
        FROM    cte ci
        WHERE   ci.e_id = co.e_id
        ORDER BY
                sn DESC, scale
        )
FROM    cte co
GROUP BY
        e_id
e_id
1 .2000

To select the maximal modal value instead of the minimal one, we would need to add a DESC to the score in the ORDER BY clause.

By picking a right final part of the ORDER BY clause, any other value can be returned.

However, the first part of the ORDER BY (sn DESC) should always remain the same, since it's what makes a modal value to be selected first.

Average of all modal values

To select the average of all modal values, we need to calculate it in the subquery as well.

We can use a very interesting predicate here:

WITH    cte AS
        (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        (
        SELECT  AVG(scale)
        FROM    cte ci
        WHERE   ci.e_id = co.e_id
                AND ci.scale = MAX(co.scale)
        )
FROM    cte co
GROUP BY
        e_id
e_id
1 .500000

Note this line in the WHERE clause:

                AND ci.scale = MAX(co.scale)

This is a naive approach most beginner developers try to use to select records holding a maximal value of the column.

This of course never works for this purpose (because WHERE clause is evaluated before the aggregation). In our case, though, we use MAX as a reference value in the SELECT-level correlated subquery, with the aggregation already performed. So the value of the MAX (taken from results of the other query) can be used in the WHERE clause alright.

AVG in the subquery can be replaced by any other aggregate.

If replaced by MIN or MAX, this aggregate variant becomes a synonym for the first query (which selects a single value). However, the first approach is more efficient for that.

Returning all modes

To return all modal values, we should use CROSS APPLY clause, and return all DISTINCT scales held by the maximal number of records.

Since it's performed before GROUP BY, we will need to wrap the original query into the subquery (so that GROUP BY is performed first). This way, we can also calculate the MAX(sn) to use it in the CROSS APPLY later:

WITH    cte AS
        (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER (PARTITION BY e_id) AS cn,
                COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
        FROM    [20100118_mode].ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id, scale
FROM    (
        SELECT  e_id, MAX(sn) AS msn
        FROM    cte
        GROUP BY
                e_id
        ) co
CROSS APPLY
        (
        SELECT  DISTINCT scale
        FROM    cte ci
        WHERE   ci.e_id = co.e_id
                AND ci.sn = co.msn
        ) ci

e_id scale
1 .2000
1 .3000
1 .4000

This query, unlike previous two, returns several records per e_id, each holding a modal value of scale.

Written by Quassnoi

January 18th, 2010 at 11:00 pm

Posted in SQL Server

Leave a Reply