EXPLAIN EXTENDED

How to create fast database queries

Archive for January 18th, 2010

Calculating mode

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query:

WITH    CTE AS (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER(PARTITION BY e_id) AS cn
        FROM    ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        AVG(scale) AS [AVG],
        STDEV(scale) AS [StdDev],
        AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
        MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]
FROM    CTE
GROUP BY
        e_id

How do I add Mode to this query?

A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.

In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.

Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.

This means that we should define a way on how to choose this mode.

There can be three approaches to this:

  1. Return every modal value
  2. Return a single modal value
  3. Return an aggregate of all modal values

To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »

Written by Quassnoi

January 18th, 2010 at 11:00 pm

Posted in SQL Server