Calculating mode
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query:
WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian] MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian] FROM CTE GROUP BY e_idHow do I add Mode to this query?
A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.
In other words, for each e_id
, mode is the (exact) value of scale
shared by most records with this e_id
.
Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1
and 10 other records have scale = 2
(and all other values of scale
are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.
This means that we should define a way on how to choose this mode.
There can be three approaches to this:
- Return every modal value
- Return a single modal value
- Return an aggregate of all modal values
To check all queries, we will generate a simple trimodal dataset:
Table creation details
The basic idea is simple: we should find a value that is held by the maximum number or records. To do this, we need to calculate the number of records sharing a given value of scale
. We do this by adding a COUNT
as an analytical function into the CTE:
WITH cte AS ( SELECT e_id, scale, ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER (PARTITION BY e_id) AS cn, COUNT(*) OVER (PARTITION BY e_id, scale) AS sn FROM [20100118_mode].ScoreMaster WHERE scale IS NOT NULL )
Single value
In this case, we just pick a minimal modal value.
To do this, we group the results returned by e_id
(just like the original query does) and select the TOP 1 scale
ordered by sn DESC, scale
in a subquery:
WITH cte AS ( SELECT e_id, scale, ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER (PARTITION BY e_id) AS cn, COUNT(*) OVER (PARTITION BY e_id, scale) AS sn FROM [20100118_mode].ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, ( SELECT TOP 1 scale FROM cte ci WHERE ci.e_id = co.e_id ORDER BY sn DESC, scale ) FROM cte co GROUP BY e_id
e_id | |
---|---|
1 | .2000 |
To select the maximal modal value instead of the minimal one, we would need to add a DESC
to the score
in the ORDER BY
clause.
By picking a right final part of the ORDER BY
clause, any other value can be returned.
However, the first part of the ORDER BY
(sn DESC
) should always remain the same, since it's what makes a modal value to be selected first.
Average of all modal values
To select the average of all modal values, we need to calculate it in the subquery as well.
We can use a very interesting predicate here:
WITH cte AS ( SELECT e_id, scale, ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER (PARTITION BY e_id) AS cn, COUNT(*) OVER (PARTITION BY e_id, scale) AS sn FROM [20100118_mode].ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, ( SELECT AVG(scale) FROM cte ci WHERE ci.e_id = co.e_id AND ci.scale = MAX(co.scale) ) FROM cte co GROUP BY e_id
e_id | |
---|---|
1 | .500000 |
Note this line in the WHERE
clause:
AND ci.scale = MAX(co.scale)
This is a naive approach most beginner developers try to use to select records holding a maximal value of the column.
This of course never works for this purpose (because WHERE
clause is evaluated before the aggregation). In our case, though, we use MAX
as a reference value in the SELECT
-level correlated subquery, with the aggregation already performed. So the value of the MAX
(taken from results of the other query) can be used in the WHERE
clause alright.
AVG
in the subquery can be replaced by any other aggregate.
If replaced by MIN
or MAX
, this aggregate variant becomes a synonym for the first query (which selects a single value). However, the first approach is more efficient for that.
Returning all modes
To return all modal values, we should use CROSS APPLY
clause, and return all DISTINCT
scales held by the maximal number of records.
Since it's performed before GROUP BY
, we will need to wrap the original query into the subquery (so that GROUP BY
is performed first). This way, we can also calculate the MAX(sn)
to use it in the CROSS APPLY
later:
WITH cte AS ( SELECT e_id, scale, ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER (PARTITION BY e_id) AS cn, COUNT(*) OVER (PARTITION BY e_id, scale) AS sn FROM [20100118_mode].ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, scale FROM ( SELECT e_id, MAX(sn) AS msn FROM cte GROUP BY e_id ) co CROSS APPLY ( SELECT DISTINCT scale FROM cte ci WHERE ci.e_id = co.e_id AND ci.sn = co.msn ) ci
e_id | scale |
---|---|
1 | .2000 |
1 | .3000 |
1 | .4000 |
This query, unlike previous two, returns several records per e_id
, each holding a modal value of scale
.