# EXPLAIN EXTENDED

How to create fast database queries

## Calculating mode

From Stack Overflow:

I have this query:

```WITH    CTE AS (
SELECT  e_id,
scale,
ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER(PARTITION BY e_id) AS cn
FROM    ScoreMaster
WHERE   scale IS NOT NULL
)
SELECT  e_id,
AVG(scale) AS [AVG],
STDEV(scale) AS [StdDev],
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]
FROM    CTE
GROUP BY
e_id
```

How do I add Mode to this query?

A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.

In other words, for each `e_id`, mode is the (exact) value of `scale` shared by most records with this `e_id`.

Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have `scale = 1` and 10 other records have `scale = 2` (and all other values of `scale` are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.

This means that we should define a way on how to choose this mode.

There can be three approaches to this:

1. Return every modal value
2. Return a single modal value
3. Return an aggregate of all modal values

To check all queries, we will generate a simple trimodal dataset:

Table creation details

The basic idea is simple: we should find a value that is held by the maximum number or records. To do this, we need to calculate the number of records sharing a given value of `scale`. We do this by adding a `COUNT` as an analytical function into the CTE:

```WITH    cte AS
(
SELECT  e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM    [20100118_mode].ScoreMaster
WHERE   scale IS NOT NULL
)
```

### Single value

In this case, we just pick a minimal modal value.

To do this, we group the results returned by `e_id` (just like the original query does) and select the `TOP 1 scale` ordered by `sn DESC, scale` in a subquery:

```WITH    cte AS
(
SELECT  e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM    [20100118_mode].ScoreMaster
WHERE   scale IS NOT NULL
)
SELECT  e_id,
(
SELECT  TOP 1 scale
FROM    cte ci
WHERE   ci.e_id = co.e_id
ORDER BY
sn DESC, scale
)
FROM    cte co
GROUP BY
e_id
```
e_id
1 .2000

To select the maximal modal value instead of the minimal one, we would need to add a `DESC` to the `score` in the `ORDER BY` clause.

By picking a right final part of the `ORDER BY` clause, any other value can be returned.

However, the first part of the `ORDER BY` (`sn DESC`) should always remain the same, since it's what makes a modal value to be selected first.

### Average of all modal values

To select the average of all modal values, we need to calculate it in the subquery as well.

We can use a very interesting predicate here:

```WITH    cte AS
(
SELECT  e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM    [20100118_mode].ScoreMaster
WHERE   scale IS NOT NULL
)
SELECT  e_id,
(
SELECT  AVG(scale)
FROM    cte ci
WHERE   ci.e_id = co.e_id
AND ci.scale = MAX(co.scale)
)
FROM    cte co
GROUP BY
e_id
```
e_id
1 .500000

Note this line in the `WHERE` clause:

```                AND ci.scale = MAX(co.scale)
```

This is a naive approach most beginner developers try to use to select records holding a maximal value of the column.

This of course never works for this purpose (because `WHERE` clause is evaluated before the aggregation). In our case, though, we use `MAX` as a reference value in the `SELECT`-level correlated subquery, with the aggregation already performed. So the value of the `MAX` (taken from results of the other query) can be used in the `WHERE` clause alright.

`AVG` in the subquery can be replaced by any other aggregate.

If replaced by `MIN` or `MAX`, this aggregate variant becomes a synonym for the first query (which selects a single value). However, the first approach is more efficient for that.

### Returning all modes

To return all modal values, we should use `CROSS APPLY` clause, and return all `DISTINCT` scales held by the maximal number of records.

Since it's performed before `GROUP BY`, we will need to wrap the original query into the subquery (so that `GROUP BY` is performed first). This way, we can also calculate the `MAX(sn)` to use it in the `CROSS APPLY` later:

```WITH    cte AS
(
SELECT  e_id,
scale,
ROW_NUMBER() OVER (PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER (PARTITION BY e_id) AS cn,
COUNT(*) OVER (PARTITION BY e_id, scale) AS sn
FROM    [20100118_mode].ScoreMaster
WHERE   scale IS NOT NULL
)
SELECT  e_id, scale
FROM    (
SELECT  e_id, MAX(sn) AS msn
FROM    cte
GROUP BY
e_id
) co
CROSS APPLY
(
SELECT  DISTINCT scale
FROM    cte ci
WHERE   ci.e_id = co.e_id
AND ci.sn = co.msn
) ci

```
e_id scale
1 .2000
1 .3000
1 .4000

This query, unlike previous two, returns several records per `e_id`, each holding a modal value of `scale`.

Written by Quassnoi

January 18th, 2010 at 11:00 pm

Posted in SQL Server