Calculating mode
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query:
WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian] MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian] FROM CTE GROUP BY e_idHow do I add Mode to this query?
A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.
In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.
Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.
This means that we should define a way on how to choose this mode.
There can be three approaches to this:
- Return every modal value
- Return a single modal value
- Return an aggregate of all modal values
To check all queries, we will generate a simple trimodal dataset:
Table creation details
The basic idea is simple: we should find a value that is held by the maximum number or records. To do this, we need to calculate the number of records sharing a given value of scale. We do this by adding a COUNT as an analytical function into the CTE:
WITH cte AS
(
SELECT e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM [20100118_mode].ScoreMaster
WHERE scale IS NOT NULL
)
Single value
In this case, we just pick a minimal modal value.
To do this, we group the results returned by e_id (just like the original query does) and select the TOP 1 scale ordered by sn DESC, scale in a subquery:
WITH cte AS
(
SELECT e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM [20100118_mode].ScoreMaster
WHERE scale IS NOT NULL
)
SELECT e_id,
(
SELECT TOP 1 scale
FROM cte ci
WHERE ci.e_id = co.e_id
ORDER BY
sn DESC, scale
)
FROM cte co
GROUP BY
e_id
| e_id | |
|---|---|
| 1 | .2000 |
To select the maximal modal value instead of the minimal one, we would need to add a DESC to the score in the ORDER BY clause.
By picking a right final part of the ORDER BY clause, any other value can be returned.
However, the first part of the ORDER BY (sn DESC) should always remain the same, since it's what makes a modal value to be selected first.
Average of all modal values
To select the average of all modal values, we need to calculate it in the subquery as well.
We can use a very interesting predicate here:
WITH cte AS
(
SELECT e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM [20100118_mode].ScoreMaster
WHERE scale IS NOT NULL
)
SELECT e_id,
(
SELECT AVG(scale)
FROM cte ci
WHERE ci.e_id = co.e_id
AND ci.scale = MAX(co.scale)
)
FROM cte co
GROUP BY
e_id
| e_id | |
|---|---|
| 1 | .500000 |
Note this line in the WHERE clause:
AND ci.scale = MAX(co.scale)
This is a naive approach most beginner developers try to use to select records holding a maximal value of the column.
This of course never works for this purpose (because WHERE clause is evaluated before the aggregation). In our case, though, we use MAX as a reference value in the SELECT-level correlated subquery, with the aggregation already performed. So the value of the MAX (taken from results of the other query) can be used in the WHERE clause alright.
AVG in the subquery can be replaced by any other aggregate.
If replaced by MIN or MAX, this aggregate variant becomes a synonym for the first query (which selects a single value). However, the first approach is more efficient for that.
Returning all modes
To return all modal values, we should use CROSS APPLY clause, and return all DISTINCT scales held by the maximal number of records.
Since it's performed before GROUP BY, we will need to wrap the original query into the subquery (so that GROUP BY is performed first). This way, we can also calculate the MAX(sn) to use it in the CROSS APPLY later:
WITH cte AS
(
SELECT e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM [20100118_mode].ScoreMaster
WHERE scale IS NOT NULL
)
SELECT e_id, scale
FROM (
SELECT e_id, MAX(sn) AS msn
FROM cte
GROUP BY
e_id
) co
CROSS APPLY
(
SELECT DISTINCT scale
FROM cte ci
WHERE ci.e_id = co.e_id
AND ci.sn = co.msn
) ci
| e_id | scale |
|---|---|
| 1 | .2000 |
| 1 | .3000 |
| 1 | .4000 |
This query, unlike previous two, returns several records per e_id, each holding a modal value of scale.
Subscribe in a reader