EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Happy New Year!. You're welcome to read and comment on it.

Archive for the ‘Miscellaneous’ Category

Happy New Year!

with 11 comments

This winter is anomalously warm in Europe, there is no snow and no New Year mood. So today we will be drawing a snowflake in PostgreSQL.

#1. A little theory

Core of a snowflake is six large symmetrical ice crystals growing from the common center. Out of these larger crystals other, smaller, crystals grow.

The overall shape of the snowflake is defined by how large do crystals grow and where exactly are they attached to each other.

These things are defined by fluctuations in air and temperature conditions around the snowflake. Because the flake itself is very small, in any given moment the conditions are nearly identical around each crystal, that’s why the offspring crystals pop up in almost same places and grow to almost same lengths. Different flakes, though, constantly move to and from each other and are subject to very different fluctuations, and that’s why they grow so unique.

Except for the root crystals (of which there are six), the child icicles grow in symmetrical pairs. More than that, each branch grows their own children (also in pairs), so on each step there are twice as many crystals, but they all share almost same length and angle. This gives the snowflake its symmetrical look.

So we can easily see that, despite the fact there may be many child crystals, the shape of a snowflake is defined by a relatively small number of parameters: how many children each crystal produces, where are they attached to it, at which angle they grow and to which length.

Now, let’s try to model it.
Read the rest of this entry »

Written by Quassnoi

December 31st, 2011 at 11:00 pm

Posted in Miscellaneous

Shared Plan and Algorithm Network Cache (SPANC)

with 5 comments

Due to the nature of my work I have to deal with various database systems.

While SQL is more or less standardized, the optimizers are implemented differently in the different systems. Some systems cannot join tables with anything other than nested loops, the other can only GROUP BY using a sort, etc.

So when you write a join in, say, MySQL, you cannot expect it to be a sort merge join (and you should consider this fact when designing the query). Or, when you write a DISTINCT in SQL Server, you can’t expect a loose index scan. These are limitations put by their optimizers.

However, in the last three months I noticed a great improvement in the queries where I could not expect any.

It started when I tried to debug this in SQL Server:

SELECT  DISTINCT order
FROM    orderItem

while yielded this plan:

Similar results were obtained on Oracle:

Plan hash value: 1345318323

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                                         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                                  |             |       |   200 |     2  (50)| 00:00:01 |
|   1 |  REMOTE SPANC QUERY (SQLSERVER, MYSQL, POSTGRESQL, STACKOVERFLOW) |             |       |   200 |     2  (50)| 00:00:01 |
---------------------------------------------------------------------------------------------------------------------------------

, MySQL:

+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
| id | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows    | Extra                                               |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
|  1 | SIMPLE      | orderItem | spanc | NULL          | ALL     | NULL    | NULL |         | Using Oracle, PostgreSQL, SQL Server, StackOverflow |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+

and PostgreSQL:

Seq Scan on OrderItem  (cost=0.00..6.44 width=4)
 -> Remote Scan on SPANC (Oracle, MySQL, SQL Server, StackOverflow)   (cost=0.00..100.00 width=4)

Network analysis has shown weird encrypted activity between the servers in my internal network which host SQL Server, Oracle, PostgreSQL and MySQL servers.

Ultimately, there was unencrypted activity outside of the internal network which turned out to be an HTTP POST request followed by several GET polls to http://stackoverflow.com/questions/5518080/distinct-optimization.

It seems that the developers of major database systems agreed to share the knowledge about the most efficient query plans in some kind of a distributed storage (which probably is called SPANC as we can see in the query plans) and provide an interface to access each other’s systems.

It also seems that these systems treat Stack Overflow as an external optimization engine where the most experienced developers can build their plans for them in a most efficient way.

I would be glad to have further clarification from the companies staff.

This also begs a question: how many of regular Stack Overflow participants are in fact query engines disguised as curious fellow developers?

It would be definitely nice to know.

Written by Quassnoi

April 1st, 2011 at 11:00 pm

Posted in Miscellaneous

Things SQL needs: SERIES()

with 9 comments

Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:

id source value
1 1 10
2 1 20
3 2 15
4 2 25
5 1 45
6 3 50
7 3 35
8 1 40
9 1 10

The records are ordered by id, and within this order there are continuous series of records which share the same value of source. In the table above, the series are separated by thick lines.

We want to calculate some aggregates across each of the series: MIN, MAX, SUM, AVG, whatever:

source min max sum avg
1 10 20 30 15.00
2 15 25 40 20.00
1 45 45 45 45.00
3 35 50 85 42.50
1 10 40 50 25.00

This can be used for different things. I used that for:

  • Reading sensors from a moving elevator (thus tracking its position)
  • Recording user’s activity on a site
  • Tracking the primary node in a server cluster

, but almost any seasoned database developer can recall a need for such a query.

As you can see, the values of source are repeating so a mere GROUP BY won’t work here.

In the systems supporting window functions there is a workaround for that:

Read the rest of this entry »

Written by Quassnoi

February 18th, 2011 at 11:00 pm

Posted in Miscellaneous

Happy New Year!

with 9 comments

Some say SQL is not good at graphics.

Well, they have some point. Database engines lack scanner drivers, there is no easy way to do sepia, and magic wand, let’s be honest, is just poor.

However, you can make some New Year paintings with SQL.

Let’s make a New Year clock showing 12 o’clock in Oracle.

Read the rest of this entry »

Written by Quassnoi

December 31st, 2010 at 11:00 pm

Posted in Miscellaneous

Things SQL needs: determining range cardinality

with 14 comments

What is the problem with this query?

SELECT  *
FROM    orders
WHERE   quantity < = 4
        AND urgency <= 4


The problem is indexing strategy, of course. Which columns should we index?

If we index quantity, the optimizer will be able to use the index to filter on it. However, filtering on urgency will require scanning all records with quantity < 4 and applying the urgency filter to each record found.

Same with urgency. We can use range access on urgency using an index, but this will require filtering on quantity.

Why, create a composite index!, some will say.

Unfortunately, that won’t help much.

A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on (quantity, urgency) will sort on quantity, and only if the quantities are equal, it will take the urgency into account.

The picture below shows how would the records be ordered in such an index:

As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.

Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:
Read the rest of this entry »

Written by Quassnoi

May 19th, 2010 at 11:00 pm

Posted in Miscellaneous

Things SQL needs: MERGE JOIN that would seek

with 6 comments

One of the most known and least used join algorithms in SQL engines is MERGE JOIN.

This algorithm operates on two sorted recordsets, keeping two pointers that chase each other.

The Wikipedia entry above describes it quite well in terms of algorithms. I’ll just make an animated GIF to make it more clear:

This is quite a nice and elegant algorithm, which, unfortunately, has two major drawbacks:

  1. It needs the recordsets to be sorted
  2. Even with the recordsets sorted, it is no better than a HASH JOIN

The sorting part is essential for this algorithm and there is nothing that can be done with it: the recordsets should be sorted, period. Databases, however, often provide the records in the sorted order: from clustered tables, indexes, previously sorted and ordered subqueries, spool tables etc.

But even when the recordsets are already sorted, on equijoins the MERGE JOIN is hardly faster than a HASH JOIN.

Why?
Read the rest of this entry »

Written by Quassnoi

May 7th, 2010 at 11:00 pm

Posted in Miscellaneous

NoSQL

with 6 comments

I had a vision tonight.

Image by nataliej

A huge, dark, grim figure approached me, seized me with its long bony arms and made me see all the vanity of the world we are living in.

Bloated database engines, useless ACID requirements, meaningless joins are now in the past for me.

I decided to move to NoSQL.

Where do I begin?

Written by Quassnoi

April 1st, 2010 at 1:00 pm

Posted in Miscellaneous

Things SQL needs: sargability of monotonic functions

with 5 comments

Image by ndevil

I’m going to write a series of articles about the things SQL needs to work faster and more efficienly.

With these things implemented, most of the tricks I explain in my blog will become redundant and I’ll finally have some more time to spend with the family.

Ever wondered why a condition like this:

WHERE TRUNC(mydate) = TRUNC(SYSDATE)

, which searches for the current day’s records, is so elegant but so slow?

Of course this is because even if you create an index on mydate, this index cannot be used.

The expression in the left part of the equality is not a mydate. The database engine cannot find a way to use an index to search for it. It is said that this expression is not sargable.

Now, a little explanation about the indexes and sargability. If you are familiar with these, you can skip this chapter. But beware that this chapter is the only one illustrated, so skipping it will make the article too boring to read.

Ahem.

To locate a record in a B-Tree index, the keys of the index should be compared to the value being searched for.

Let’s consider this sample B-Tree index:


Read the rest of this entry »

Written by Quassnoi

February 19th, 2010 at 11:00 pm

Posted in Miscellaneous

Happy New Year!

with 5 comments

Oracle New Year tree

WITH    tree AS
        (
        SELECT  /*+ MATERIALIZE */
                '  ,,,"''' AS needles, 'Oo%$&' AS decorations
        FROM    dual
        ),
        branches AS
        (
        SELECT  level AS id, level - TRUNC((level - 2) / 6) * 2 AS s
        FROM    dual
        CONNECT BY
                level <= 24
        )
SELECT  RPAD(' ', 18, ' ') || '*' AS tree
FROM    dual
UNION ALL
SELECT  RPAD(' ', 18 - s, ' ') || '/' ||
        (
        SELECT  REPLACE(
                SYS_CONNECT_BY_PATH
                (
                CASE
                WHEN DBMS_RANDOM.value < 0.1 THEN
                        SUBSTR(decorations, TRUNC(DBMS_RANDOM.value * LENGTH(decorations)) + 1, 1)
                ELSE
                        SUBSTR(needles, TRUNC(DBMS_RANDOM.value * LENGTH(needles)) + 1, 1)
                END , '/'
                ),
                '/', ''
                )
        FROM    tree
        WHERE   CONNECT_BY_ISLEAF = 1
        CONNECT BY
                level <= s * 2 - 1
        )
        || '\'
FROM    branches
TREE
*
/,\
/,, \
/$," ,\
/, ',, \
/ ,'o'' ,\
/,, ,'",,,, \
/,',,$, '' ,,\
/,, ,'",,,, \
/",, ,,"' , ",\
/ $' ,, " ,' '\
/, , '"O,,'o"," "\
/,,,," , ',,',,',',\
/,',,"' ' ' ,,,,&"o,\
/,' ,,," ,',%" ,,,\
/,',,"' ' ' ,,,,&"o,\
/,$,,," ,,$'%",",,%,,''"\
/ ' ,, ," ,,& , ,, ,, \
/,",O' ,',, , ,$, "',",,,',\
/,," ",, ,,, ,',', " ,""',\
/ ,O ',,,,," o , '','', ,&,\
/,," ",, ,,, ,',', " ,""',\
/,"'',,""$"," ,$,%,' "" ,,,',",\
/, '" "O,,o' , ,'% $',,, ,""" , \
/, O' ,,, &,',,' , ',$, ,' ,,, ,,'\

SQL Server bottle of champagne

WITH    lines AS
        (
        SELECT  1 AS n
        UNION ALL
        SELECT  n + 1
        FROM    lines
        WHERE   n < 24
        ),
        bottle_parts (part, lft, body, rgt, w) AS
        (
        SELECT  'mouth', '.', '_', '.', 2
        UNION ALL
        SELECT  'neck', '|', ' ', '|', 2
        UNION ALL
        SELECT  'bell', '/', ' ', '\', NULL
        UNION ALL
        SELECT  'border', '/', '=', '\', NULL
        UNION ALL
        SELECT  'body', '|', ' ', '|', 7
        UNION ALL
        SELECT  'bottom', '(', '_', ')', 7
        ),
        bottle_lines AS
        (
        SELECT  CASE
                WHEN n = 1 THEN 'mouth'
                WHEN n BETWEEN 2 AND 8 THEN 'neck'
                WHEN n = 11 THEN 'border'
                WHEN n BETWEEN 9 AND 13 THEN 'bell'
                WHEN n BETWEEN 13 AND 23 THEN 'body'
                WHEN n = 24 THEN 'bottom'
                END AS part,
                n
        FROM    lines
        ),
        bottle_width AS
        (
        SELECT  bp.*, n, COALESCE(w, (n - 6)) AS width
        FROM    bottle_lines bl
        JOIN
                bottle_parts bp
        ON      bp.part = bl.part
        ),
        bottle AS
        (
        SELECT  n, width, REPLICATE(' ', 18 - width) + lft + REPLICATE(body, width * 2) + rgt AS line
        FROM    bottle_width
        ),
        bubbles (pos, m, bubble) AS
        (
        SELECT  CAST(ROUND(RAND() * 4, 0, 1) AS INTEGER), 1, ':' AS bubble
        UNION ALL
        SELECT  pos + ASCII(SUBSTRING(CAST(NEWID() AS VARCHAR(MAX)), 1, 1)) % 3 - 1, m + 1, SUBSTRING(':.oO', m / 3 + 1, 1)
        FROM    bubbles
        WHERE   m < 12
        ),
        limits AS
        (
        SELECT  *,
                18 +
                CASE
                WHEN pos < 1 - width THEN 1 - width
                WHEN pos > width - 1 THEN width - 1
                ELSE pos
                END AS rpos
        FROM    bottle bo
        LEFT HASH JOIN
                bubbles bu
        ON      bu.m = 24 - bo.n
        )
SELECT  SUBSTRING(line, 1, COALESCE(rpos, 1) - 1) + COALESCE(bubble, ' ') + SUBSTRING(line, COALESCE(rpos, 1) + 1, 200) AS bottle
FROM    limits
ORDER BY
        n
bottle
.____.
| |
| |
| |
| |
| |
| |
| |
/ \
/ \
/==========\
/ O \
/ O \
| O |
| o |
| o |
| o |
| . |
| . |
| . |
| : |
| : |
| : |
(______________)

PostgreSQL fireworks

WITH    centers AS
        (
        SELECT  angle,
                len,
                ROUND(len * SIN(2 * PI() * angle)) AS x,
                ROUND(len * COS(2 * PI() * angle)) AS y,
                ROUND(len * 0.3)::INTEGER + 1 AS trace
        FROM    (
                SELECT  RANDOM() AS angle,
                        8 * (1 - POWER(RANDOM(), 3)) AS len
                FROM    generate_series (1, 50) s
                ) q
        ),
        traces AS
        (
        SELECT  *,
                generate_series(1, trace) AS part
        FROM    centers
        ),
        parts AS
        (
        SELECT  CASE
                WHEN trace = part THEN
                        LEAST(len * 0.2, 2)::INTEGER
                ELSE
                        TRUNC(angle * 8 - 0.5)::INTEGER % 4 + 3
                END AS symbol,
                TRUNC(x + part * SIN(2 * PI() * angle)) AS x,
                TRUNC(y + part * COS(2 * PI() * angle)) AS y
        FROM    traces
        )
        SELECT  ARRAY_TO_STRING(
        ARRAY(
        SELECT  COALESCE(
                (
                SELECT  SUBSTR(E'.xX\\-/|', MIN(symbol) + 1, 1)
                FROM    parts
                WHERE   x = col - 14
                        AND y = row - 12
                ), ' ')
        FROM    generate_series(1, 25) col
        ), ''
        ) AS FIREWORKS
FROM    generate_series(1, 24) row

fireworks
X
| X
x| | X
x | | /
X\ \x | /x
X\ \ \ /
X \ \ x / /xx
\\ x |. ///
X– \ |
xxx– .
Xx– ..
xxx-/ –x
. \\ –X
. .
/ \
X/ | \ \x
/ |\ \\
// |x\ \ \x
x/ |\x \ X
X x\x X
X

MySQL postcard

SELECT  CONCAT(COALESCE(border, '|'), RPAD(COALESCE(postcard, ''), 61, COALESCE(filler, ' ')), COALESCE(border, '|')) AS postcard
FROM    (
        SELECT  id2 * 5 + id1 AS line
        FROM    (
                SELECT 1 AS id1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
                ) q1
        CROSS JOIN
                (
                SELECT 0 AS id2 UNION ALL SELECT 1
                ) q2
        ) dummy
LEFT JOIN
        (
        SELECT  3 AS line, CONCAT(' Dear ', USER(), ',') AS postcard
        UNION ALL
        SELECT  4, ' I wish you good luck in the New Year.'
        UNION ALL
        SELECT  5, ' May all your plans be optimal, all your queries be answered'
        UNION ALL
        SELECT  6, ' and all your estimations be correct.'
        UNION ALL
        SELECT  8, '               Yours sincerely,'
        UNION ALL
        SELECT  9, '                                                    Quassnoi'
        ) ln
ON      ln.line = dummy.line
LEFT JOIN
        (
        SELECT 1 AS line, '.' AS border, '_' AS filler
        UNION ALL
        SELECT 10 AS line, '.' AS border, '_' AS filler
        ) borders
ON      borders.line = dummy.line
ORDER BY
        dummy.line
postcard
._____________________________________________________________.
| |
| Dear dbuser@localhost, |
| I wish you good luck in the New Year. |
| May all your plans be optimal, all your queries be answered |
| and all your estimations be correct. |
| |
| Yours sincerely, |
| Quassnoi |
._____________________________________________________________.
Happy New Year!

Written by Quassnoi

December 31st, 2009 at 11:00 pm

Posted in Miscellaneous

What is entity-relationship model?

with 12 comments

Image by Samuel Mann

Image by Samuel Mann

A relational database, as we all know from the previous article, stores relations between integers, strings and other simple types in a very plain way: it just enumerates all values related.

This model is extremely flexible, since other relations can be easily derived from existing ones using mathematical formulae, and the relational database takes care of that.

However, database should reflect the real world in some way to be really of use. Since the relational databases store relations between mathematical abstractions and not real world things, we should make some kind of a mapping of ones to the others.

This is what entity-relationship model is for.

An entity-relationship model, as one can easily guess from its name, models relationships between entities.

But since we know that databases do essentially the same, how does it differ from the database model?

  • An entity-relationship model states which data and relations between them should be stored
  • A database model states how these relations are stored

In other words, ER model is design and database model is one of the ways to implement it. ER model is said to be above the database model in the waterfall developement.

Note that in both cases I use the word stored above. The model says nothing of data and relations between them that may or should exist, only of those that should be stored. Every human being participates in thousands if not millions relationships, but an ER model should state which of them are to be stored, and a relational model should decide how to store them. Of course it would be nice to store them all (and police and insurance would be just happy), but the technology does not allow it yet. Don’t know whether it is good or bad.

A typical entity-relationship diagram in Peter Chen’s notation looks like this:

An entity-relationship diagram

What does it mean?
Read the rest of this entry »

Written by Quassnoi

October 18th, 2009 at 11:00 pm

Posted in Miscellaneous