EXPLAIN EXTENDED

How to create fast database queries

Calculating mode

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query:

WITH    CTE AS (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER(PARTITION BY e_id) AS cn
        FROM    ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        AVG(scale) AS [AVG],
        STDEV(scale) AS [StdDev],
        AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
        MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]
FROM    CTE
GROUP BY
        e_id

How do I add Mode to this query?

A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.

In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.

Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.

This means that we should define a way on how to choose this mode.

There can be three approaches to this:

  1. Return every modal value
  2. Return a single modal value
  3. Return an aggregate of all modal values

To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »

Written by Quassnoi

January 18th, 2010 at 11:00 pm

Posted in SQL Server

Latest visitors

Comments enabled. I *really* need your comment

From Stack Overflow:

Say you want to display the latest visitors on a users profile page.

How would you structure this?

Perhaps a table called uservisitors:

uservisitors
userid visitorid time

And how would you select this with MySQL without any duplicates?

What I mean is if user 1 visits user 2's profile, then 5 minutes later visits it again, I don't want to show both entries: only the latest.

There are two approaches to this.

First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:

SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 1
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5

However, there is a little problem with this solution.

Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY optimization for this query, it will still have to sort on MAX(time) to find 5 latest records.

This will require sorting the whole resultset which will be huge if the service is heavily loaded.

Let's test it on a sample table:
Read the rest of this entry »

Written by Quassnoi

January 12th, 2010 at 11:00 pm

Posted in MySQL

Happy New Year!

with 5 comments

Oracle New Year tree

WITH    tree AS
(
SELECT  /*+ MATERIALIZE */
'  ,,,"''' AS needles, 'Oo%$&' AS decorations
FROM    dual
),
branches AS
(
SELECT  level AS id, level - TRUNC((level - 2) / 6) * 2 AS s
FROM    dual
CONNECT BY
level <= 24
        )
SELECT  RPAD(' ', 18, ' ') || '*' AS tree
FROM    dual
UNION ALL
SELECT  RPAD(' ', 18 - s, ' ') || '/' ||
        (
        SELECT  REPLACE(
                SYS_CONNECT_BY_PATH
                (
                CASE
                WHEN DBMS_RANDOM.value < 0.1 THEN
                        SUBSTR(decorations, TRUNC(DBMS_RANDOM.value * LENGTH(decorations)) + 1, 1)
                ELSE
                        SUBSTR(needles, TRUNC(DBMS_RANDOM.value * LENGTH(needles)) + 1, 1)
                END , '/'
                ),
                '/', ''
                )
        FROM    tree
        WHERE   CONNECT_BY_ISLEAF = 1
        CONNECT BY
                level <= s * 2 - 1
        )
        || '\'
FROM    branches
TREE
*
/,\
/,, \
/$," ,\
/, ',, \
/ ,'o'' ,\
/,, ,'",,,, \
/,',,$, '' ,,\
/,, ,'",,,, \
/",, ,,"' , ",\
/ $' ,, " ,' '\
/, , '"O,,'o"," "\
/,,,," , ',,',,',',\
/,',,"' ' ' ,,,,&"o,\
/,' ,,," ,',%" ,,,\
/,',,"' ' ' ,,,,&"o,\
/,$,,," ,,$'%",",,%,,''"\
/ ' ,, ," ,,& , ,, ,, \
/,",O' ,',, , ,$, "',",,,',\
/,," ",, ,,, ,',', " ,""',\
/ ,O ',,,,," o , '','', ,&,\
/,," ",, ,,, ,',', " ,""',\
/,"'',,""$"," ,$,%,' "" ,,,',",\
/, '" "O,,o' , ,'% $',,, ,""" , \
/, O' ,,, &,',,' , ',$, ,' ,,, ,,'\

SQL Server bottle of champagne

WITH    lines AS
(
SELECT  1 AS n
UNION ALL
SELECT  n + 1
FROM    lines
WHERE   n < 24
        ),
        bottle_parts (part, lft, body, rgt, w) AS
        (
        SELECT  'mouth', '.', '_', '.', 2
        UNION ALL
        SELECT  'neck', '|', ' ', '|', 2
        UNION ALL
        SELECT  'bell', '/', ' ', '\', NULL
        UNION ALL
        SELECT  'border', '/', '=', '\', NULL
        UNION ALL
        SELECT  'body', '|', ' ', '|', 7
        UNION ALL
        SELECT  'bottom', '(', '_', ')', 7
        ),
        bottle_lines AS
        (
        SELECT  CASE
                WHEN n = 1 THEN 'mouth'
                WHEN n BETWEEN 2 AND 8 THEN 'neck'
                WHEN n = 11 THEN 'border'
                WHEN n BETWEEN 9 AND 13 THEN 'bell'
                WHEN n BETWEEN 13 AND 23 THEN 'body'
                WHEN n = 24 THEN 'bottom'
                END AS part,
                n
        FROM    lines
        ),
        bottle_width AS
        (
        SELECT  bp.*, n, COALESCE(w, (n - 6)) AS width
        FROM    bottle_lines bl
        JOIN
                bottle_parts bp
        ON      bp.part = bl.part
        ),
        bottle AS
        (
        SELECT  n, width, REPLICATE(' ', 18 - width) + lft + REPLICATE(body, width * 2) + rgt AS line
        FROM    bottle_width
        ),
        bubbles (pos, m, bubble) AS
        (
        SELECT  CAST(ROUND(RAND() * 4, 0, 1) AS INTEGER), 1, ':' AS bubble
        UNION ALL
        SELECT  pos + ASCII(SUBSTRING(CAST(NEWID() AS VARCHAR(MAX)), 1, 1)) % 3 - 1, m + 1, SUBSTRING(':.oO', m / 3 + 1, 1)
        FROM    bubbles
        WHERE   m < 12
        ),
        limits AS
        (
        SELECT  *,
                18 +
                CASE
                WHEN pos < 1 - width THEN 1 - width
                WHEN pos > width - 1 THEN width - 1
ELSE pos
END AS rpos
FROM    bottle bo
LEFT HASH JOIN
bubbles bu
ON      bu.m = 24 - bo.n
)
SELECT  SUBSTRING(line, 1, COALESCE(rpos, 1) - 1) + COALESCE(bubble, ' ') + SUBSTRING(line, COALESCE(rpos, 1) + 1, 200) AS bottle
FROM    limits
ORDER BY
n
bottle
.____.
| |
| |
| |
| |
| |
| |
| |
/ \
/ \
/==========\
/ O \
/ O \
| O |
| o |
| o |
| o |
| . |
| . |
| . |
| : |
| : |
| : |
(______________)

PostgreSQL fireworks

WITH    centers AS
(
SELECT  angle,
len,
ROUND(len * SIN(2 * PI() * angle)) AS x,
ROUND(len * COS(2 * PI() * angle)) AS y,
ROUND(len * 0.3)::INTEGER + 1 AS trace
FROM    (
SELECT  RANDOM() AS angle,
8 * (1 - POWER(RANDOM(), 3)) AS len
FROM    generate_series (1, 50) s
) q
),
traces AS
(
SELECT  *,
generate_series(1, trace) AS part
FROM    centers
),
parts AS
(
SELECT  CASE
WHEN trace = part THEN
LEAST(len * 0.2, 2)::INTEGER
ELSE
TRUNC(angle * 8 - 0.5)::INTEGER % 4 + 3
END AS symbol,
TRUNC(x + part * SIN(2 * PI() * angle)) AS x,
TRUNC(y + part * COS(2 * PI() * angle)) AS y
FROM    traces
)
SELECT  ARRAY_TO_STRING(
ARRAY(
SELECT  COALESCE(
(
SELECT  SUBSTR(E'.xX\\-/|', MIN(symbol) + 1, 1)
FROM    parts
WHERE   x = col - 14
AND y = row - 12
), ' ')
FROM    generate_series(1, 25) col
), ''
) AS FIREWORKS
FROM    generate_series(1, 24) row

fireworks
X
| X
x| | X
x | | /
X\ \x | /x
X\ \ \ /
X \ \ x / /xx
\\ x |. ///
X-- \ |
xxx-- .
Xx-- ..
xxx-/ --x
. \\ --X
. .
/ \
X/ | \ \x
/ |\ \\
// |x\ \ \x
x/ |\x \ X
X x\x X
X

MySQL postcard

SELECT  CONCAT(COALESCE(border, '|'), RPAD(COALESCE(postcard, ''), 61, COALESCE(filler, ' ')), COALESCE(border, '|')) AS postcard
FROM    (
SELECT  id2 * 5 + id1 AS line
FROM    (
SELECT 1 AS id1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
) q1
CROSS JOIN
(
SELECT 0 AS id2 UNION ALL SELECT 1
) q2
) dummy
LEFT JOIN
(
SELECT  3 AS line, CONCAT(' Dear ', USER(), ',') AS postcard
UNION ALL
SELECT  4, ' I wish you good luck in the New Year.'
UNION ALL
SELECT  5, ' May all your plans be optimal, all your queries be answered'
UNION ALL
SELECT  6, ' and all your estimations be correct.'
UNION ALL
SELECT  8, '               Yours sincerely,'
UNION ALL
SELECT  9, '                                                    Quassnoi'
) ln
ON      ln.line = dummy.line
LEFT JOIN
(
SELECT 1 AS line, '.' AS border, '_' AS filler
UNION ALL
SELECT 10 AS line, '.' AS border, '_' AS filler
) borders
ON      borders.line = dummy.line
ORDER BY
dummy.line
postcard
._____________________________________________________________.
| |
| Dear dbuser@localhost, |
| I wish you good luck in the New Year. |
| May all your plans be optimal, all your queries be answered |
| and all your estimations be correct. |
| |
| Yours sincerely, |
| Quassnoi |
._____________________________________________________________.
Happy New Year!

Written by Quassnoi

December 31st, 2009 at 11:00 pm

Posted in Miscellaneous

Grouping continuous ranges

with one comment

From Stack Overflow:

I have a ticketing system.

Now I have to select adjacent places when the user asks for 2 or 3 tickets.

Every ticket has a line and column number. The concept of adjacent places is places in the same line with adjacent columns numbers.

These tickets are in an SQL Server database.

Any ideas about this algorithm to search for available adjacent seats?

This is a problem known as grouping continuous ranges: finding the continuous ranges of the records (in a certain order) having the same value of the grouping column.

For the table in question, the groups will look like this:

Seats
Row Column Occupied Group
1 1 1 -
1 2 0 1
1 3 0 1
1 4 0 1
1 5 1 -
1 6 1 -
1 7 0 2
1 8 0 2
1 9 1 -
1 10 0 3

To find the spans of free seats with the required length, we need to group the continuous ranges of rows with occupied = 0 in column order (within the rows), calculate the counts of records in these groups and return the groups having sufficient value of COUNT(*).

To group something in SQL, we need to have an expression that would have the same value for all records belonging to the group.

How do we build such an expression for the continuous ranges?

In a movie theater, seats are numbered sequentially, so we can rely on the values of the column being continuous. Now, let's calculate only the records that describe the free seats, and build their row numbers in the column order:

Seats
Row Column Occupied ROW_NUMBER
1 2 0 1
1 3 0 2
1 4 0 3
1 7 0 4
1 8 0 5
1 10 0 6

We see that filtering only free seats broke the column numbering order (the columns numbers are not continuous anymore), but the ROW_NUMBERs are continuous (by definition). Each occupied seat breaks the continuity of the columns, and the spans of the free seats correspond to the ranges of columns with unbroken continuity.

If the continuity is not broken, column and ROW_NUMBER both increase by 1 with each record. This means that their difference will be constant:

Seats
Row Column Occupied ROW_NUMBER column - ROW_NUMBER
1 2 0 1 1
1 3 0 2 1
1 4 0 3 1
1 7 0 4 3
1 8 0 5 3
1 10 0 6 4

The value of difference between column and ROW_NUMBER, therefore, uniquely defines the continuous group the records belong to.

This means we can group on it and do other things (the rest being pure technical).

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 30th, 2009 at 11:00 pm

Posted in SQL Server

Linked lists in MySQL: multiple ordering

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Rick McIntosh asks:

I have two tables, one with a list of spreadsheets (a) and the other with a list of the column headings that show in the spreadsheets (b).

a
id parent
1 2
2 0
b
id parent aid
1 1 1
2 0 1
3 2 1
4 6 2
5 4 2
6 0 2

I want to bring the columns back in first the order of the spreadsheets as defined by the a.parent_id then ordered as b.parent_id:

id parent aid aparent
6 0 2 0
4 6 2 0
5 4 2 0
2 0 1 2
3 2 1 2
1 1 1 2

This can be done using the same recursion technique as the one that was used to build a simple linked list.

MySQL does not support recursion directly, but it can be emulated using subquery calls in the SELECT clause of the query, using session variables to store the recursion state.

In this case we need to do the following:

  1. Generate a dummy recordset for recursion that would contain as many rows as the resulting recordset. This is best done by issuing a JOIN on the FOREIGN KEY, without any ordering. The values of the recordset will not be used in the actual query.
  2. Initialize @a and @b to be the first value of a and a 0, accordingly.
  3. In the loop, make a query that would return the next item of b for the current value of @a, or, should it fail, return the first item of b for the next @a. This is best done using a UNION ALL with a LIMIT.
  4. Adjust @a so that is points to the correct item: just select the appropriate value from b

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 29th, 2009 at 11:00 pm

Posted in MySQL

Oracle: joining timestamps and time intervals

with one comment

From Stack Overflow:

I am making an inner join of two tables.

I have many time intervals represented by intervals and i want to get measure data from measures only within those intervals. The intervals do not overlap.

Here are the table layouts:

intervals
entry_time exit_time
measures
time measure

There are 1295 records in intervals and about a million records in measures. The intervals do not overlap.

The result I want to get is a table with in the first column the measure, then the time the measure has been done, the begin/end time of the considered interval (it would be repeated for row with a time within the considered range)

How can I make this faster?

Straight query

The query to get the results would look like this:

SELECT  measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time ASC

This looks quite simple. However, choosing the execution plan can be somewhat tricky task for this query.

The table layout gives us no hint on how many rows will be returned from the join. If all intervals begin before the time of the first measure and end after the time of the last measure, then every combination of rows will be returned from both tables and the resulting recordset will in fact be a cartesian product of two tables. If the intervals are short and sparse, few or even no rows can be returned from measures.

However, we know that the intervals do not overlap. This means that each measure may belong to at most one interval and the number of records in the final resultset will be no more than the number of records in measures.

Since the condition we join here is a pair of inequalities, only two methods can be used by the engine to perform the join, that is NESTED LOOPS and MERGE JOIN. HASH JOIN which is most efficient on large tables requires an equijoin and cannot be used for this query.

MERGE JOIN sorts both resultsets on the columns it joins on and gradually advances the internal pointer on both sorted row sources so that the values of the columns being joined on in both tables always match. In case of an equijoin, the engine returns only the records holding the current value of the pointer; in case of a join on inequality condition, the engine returns all records greater (or less) than the current value from the corresponding table.

MERGE JOIN, however, can only satisfy a single inequality condition while we here have two of them on two different columns. The engine should split this condition into the pair of inequalities: the first one will be used by the join; the second one will be just filtered. The query essentially turns into this one:

SELECT  measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time >= intervals.entry_time
WHERE   measures.time  <= intervals.exit_time
ORDER BY
        time ASC


Here, the MERGE JOIN will be using the predicate in the ON clause and the filtering will use the predicate in the WHERE clause. The predicates are symmetrical and can easily be swapped.

However, the join will have to return all records to the filtering code. And with an inequality condition like the one we see above, there will be lots of records. If we take a normal situation: the interval bounds are more or less in accordance with the dates of the first and last measure and the intervals are distributed evenly, then the MERGE JOIN will return (1265 × 1,000,000) / 2 ≈ 600,000,000 records, each to be filtered on the next step.

From performance's point of view, this is hardly different from a cartesian join: in fact, it needs to process only as few as a half of the rows. These rows, however, need to be sorted beforehand (or taken from the index which is slow to traverse in its sort order), so this can actually even be slower than a cartesian join which does not require the rows to be in any certain order and can use an INDEX FAST FULL SCAN or a TABLE SCAN.

This means the only efficient way to run this query is using NESTED LOOPS. To benefit from this method, we should make an index on measures.time and convince Oracle to use it.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 28th, 2009 at 11:00 pm

Posted in Oracle

Oracle: Selecting records holding group-wise maximum

with one comment

It was quite a time since I posted my last article, and I have a bunch of unanswered questions that people asked me.

Sorry for not posting and answering for so long, had a few urgent things to do, and hope these breaks won't be so long in the future :)

Now, continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

Finally, Oracle.

We will create the very same table as we did for the previous systems:
Read the rest of this entry »

Written by Quassnoi

December 16th, 2009 at 11:00 pm

Posted in Oracle

SQL Server: Selecting records holding group-wise maximum (with ties)

Comments enabled. I *really* need your comment

A feedback on the yesterday's article on selecting records holding group-wise maximums in SQL Server.

Gonçalo asks:

Regarding the post I mention on the subject, wouldn't it be easier to obtain the result you're after using the SQL Server specific SELECT TOP x WITH TIES?

This clause (which is indeed specific to SQL Server) returns ties. TOP x WITH TIES means return TOP x records, plus all records holding the same values of the ORDER BY expressions as the last (x'th) record returned by the TOP.

With a little effort this can be used for one of the queries I wrote about yesterday, namely the one that returns all records holding MIN(orderer) within each group.

However, the efficiency of this solution varies greatly depending on the cardinality of both ordering and grouping columns.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

December 1st, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: Selecting records holding group-wise maximum

with one comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll consider SQL Server.

SQL Server has very rich SQL syntax and its optimizer is really powerful.

However, some tricks are still required to make the queries like this to run faster.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 30th, 2009 at 11:00 pm

Posted in SQL Server

PostgreSQL: Selecting records holding group-wise maximum

Comments enabled. I *really* need your comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll describe several ways to do this in PostgreSQL 8.4.

PostgreSQL 8.4 syntax is much richer than that of MySQL. The former can use the analytic functions, recursive CTE's and proprietary syntax extensions, all of which can be used for this task.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 26th, 2009 at 11:00 pm

Posted in PostgreSQL