Archive for January 18th, 2010
Calculating mode
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query:
WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian] MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian] FROM CTE GROUP BY e_idHow do I add Mode to this query?
A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.
In other words, for each e_id
, mode is the (exact) value of scale
shared by most records with this e_id
.
Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1
and 10 other records have scale = 2
(and all other values of scale
are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.
This means that we should define a way on how to choose this mode.
There can be three approaches to this:
- Return every modal value
- Return a single modal value
- Return an aggregate of all modal values
To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »