EXPLAIN EXTENDED

How to create fast database queries

Archive for the ‘Miscellaneous’ Category

Top 5 XKCD comics which can illustrate programming questions

with 8 comments

As many of you know, I actively participate on Stack Overflow, the leading Q/A website for software developers.

Kind folks on Stack Overflow have made their data open for examination, and anyone can query their database using this web interface at data.stackexchange.com.

Many of the questions and answers there are illustrated with links to XKCD, the web comics created by Randall Munroe.

So I decided to see which of those comics best illustrate quirks and oddities we keyboard warriors have to deal with in our daily routine.

The query itself is quite simple:

SELECT  link, cnt, id [Post Link], score
FROM    (
        SELECT  link, id, score,
                COUNT(*) OVER (PARTITION BY link) cnt,
                ROW_NUMBER() OVER (PARTITION BY link ORDER BY score DESC) rn
        FROM    (
                SELECT  id, score,
                        'http' + SUBSTRING(start, 1, PATINDEX('%/%', SUBSTRING(start, 13, 30)) + 12)
                FROM    (
                        SELECT  id, score,
                                SUBSTRING(body, PATINDEX('%://xkcd.com/[0-9]%', body), 20) start
                        FROM    (
                                SELECT  TOP 400000
                                        id, body, score
                                FROM    posts
                                ORDER BY
                                        id DESC
                                UNION
                                SELECT  TOP 4000000
                                        postId, text, score
                                FROM    comments
                                ORDER BY
                                        id DESC
                                ) q
                        WHERE   body LIKE '%://xkcd.com/[0-9]%'
                        ) q
                ) q (id, score, link)
        ) q
WHERE   rn = 1
ORDER BY
        cnt DESC

The TOP clauses are there to limit the query size, as their hosting provider does now allow fulltext indexing, and without them the query would just time out. This query only searches recent comments and posts, so some famous and top-voted posts might not be present here. Still, it's a good sample.

You can see all results here, and we'll just discuss the top 5 entries.

#5. Random Number (10 links)

Random Number

Of course the questions which link to it deal with random number generators of any kind and fallacies in their implementation.

Read the rest of this entry »

Written by Quassnoi

July 9th, 2014 at 8:00 pm

Posted in Miscellaneous

5 Claims About SQL, Explained

with 5 comments

1. Is SQL is a declarative language?

First of all, let's define the terms.

A declarative language describes the result, not the process to achieve it. In a declarative language you tell "what do I need".

A procedural language describes control flow: exact steps the machine needs to perform in order to achieve this result. In a procedural language you tell "what do you do".

SQL, as it was originally designed, is a declarative language.

2980020699_0dfbe889f7_o

For instance you need to know how many customers above 40 live in each city. You issue a query like:

Read the rest of this entry on tech.pro »

Written by Quassnoi

October 28th, 2013 at 11:00 pm

Posted in Miscellaneous

NULL in SQL: explaining its behavior

with one comment

On my blog feedback page I get lots of questions which essentially boil down to one thing: "Those NULL things in databases work in a way I don't freaking get!"

Let's have them explained a little.

The Wikipedia page defines NULL this way:

Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfill the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information".

For people new to the subject, a good way to remember what null means is to remember that in terms of information, "lack of a value" is not the same thing as "a value of zero"; similarly, "lack of an answer" is not the same thing as "yes" or "no".

Rather than trying to come up with intuitive rules for how NULL behaves, it's easier to expand on the definition above to demonstrate the motivation behind introducing the NULL at all, and why it ended up so complex.

NULL is like an "unknown variable" in algebraic equation

Let's step away from relations and tables for a moment, and remember some old school math. What is this?

Read the rest of this entry on tech.pro »

Written by Quassnoi

September 27th, 2013 at 11:00 pm

Posted in Miscellaneous

Happy New Year: drawing snowflakes in SQL

with 18 comments

This winter is anomalously warm in Europe, there is no snow and no New Year mood. So today we will be drawing a snowflake in PostgreSQL.

#1. A little theory

Core of a snowflake is six large symmetrical ice crystals growing from the common center. Out of these larger crystals other, smaller, crystals grow.

The overall shape of the snowflake is defined by how large do crystals grow and where exactly are they attached to each other.

These things are defined by fluctuations in air and temperature conditions around the snowflake. Because the flake itself is very small, in any given moment the conditions are nearly identical around each crystal, that's why the offspring crystals pop up in almost same places and grow to almost same lengths. Different flakes, though, constantly move to and from each other and are subject to very different fluctuations, and that's why they grow so unique.

Except for the root crystals (of which there are six), the child icicles grow in symmetrical pairs. More than that, each branch grows their own children (also in pairs), so on each step there are twice as many crystals, but they all share almost same length and angle. This gives the snowflake its symmetrical look.

So we can easily see that, despite the fact there may be many child crystals, the shape of a snowflake is defined by a relatively small number of parameters: how many children each crystal produces, where are they attached to it, at which angle they grow and to which length.

Now, let's try to model it.
Read the rest of this entry »

Written by Quassnoi

December 31st, 2011 at 11:00 pm

Shared Plan and Algorithm Network Cache (SPANC)

with 5 comments

Due to the nature of my work I have to deal with various database systems.

While SQL is more or less standardized, the optimizers are implemented differently in the different systems. Some systems cannot join tables with anything other than nested loops, the other can only GROUP BY using a sort, etc.

So when you write a join in, say, MySQL, you cannot expect it to be a sort merge join (and you should consider this fact when designing the query). Or, when you write a DISTINCT in SQL Server, you can't expect a loose index scan. These are limitations put by their optimizers.

However, in the last three months I noticed a great improvement in the queries where I could not expect any.

It started when I tried to debug this in SQL Server:

SELECT  DISTINCT order
FROM    orderItem

while yielded this plan:

Similar results were obtained on Oracle:

Plan hash value: 1345318323

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                                         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                                  |             |       |   200 |     2  (50)| 00:00:01 |
|   1 |  REMOTE SPANC QUERY (SQLSERVER, MYSQL, POSTGRESQL, STACKOVERFLOW) |             |       |   200 |     2  (50)| 00:00:01 |
---------------------------------------------------------------------------------------------------------------------------------

, MySQL:

+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
| id | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows    | Extra                                               |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
|  1 | SIMPLE      | orderItem | spanc | NULL          | ALL     | NULL    | NULL |         | Using Oracle, PostgreSQL, SQL Server, StackOverflow |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+

and PostgreSQL:

Seq Scan on OrderItem  (cost=0.00..6.44 width=4)
 -> Remote Scan on SPANC (Oracle, MySQL, SQL Server, StackOverflow)   (cost=0.00..100.00 width=4)

Network analysis has shown weird encrypted activity between the servers in my internal network which host SQL Server, Oracle, PostgreSQL and MySQL servers.

Ultimately, there was unencrypted activity outside of the internal network which turned out to be an HTTP POST request followed by several GET polls to http://stackoverflow.com/questions/5518080/distinct-optimization.

It seems that the developers of major database systems agreed to share the knowledge about the most efficient query plans in some kind of a distributed storage (which probably is called SPANC as we can see in the query plans) and provide an interface to access each other's systems.

It also seems that these systems treat Stack Overflow as an external optimization engine where the most experienced developers can build their plans for them in a most efficient way.

I would be glad to have further clarification from the companies staff.

This also begs a question: how many of regular Stack Overflow participants are in fact query engines disguised as curious fellow developers?

It would be definitely nice to know.

Written by Quassnoi

April 1st, 2011 at 11:00 pm

Posted in Miscellaneous

Things SQL needs: SERIES()

with 9 comments

Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:

id source value
1 1 10
2 1 20
3 2 15
4 2 25
5 1 45
6 3 50
7 3 35
8 1 40
9 1 10

The records are ordered by id, and within this order there are continuous series of records which share the same value of source. In the table above, the series are separated by thick lines.

We want to calculate some aggregates across each of the series: MIN, MAX, SUM, AVG, whatever:

source min max sum avg
1 10 20 30 15.00
2 15 25 40 20.00
1 45 45 45 45.00
3 35 50 85 42.50
1 10 40 50 25.00

This can be used for different things. I used that for:

  • Reading sensors from a moving elevator (thus tracking its position)
  • Recording user's activity on a site
  • Tracking the primary node in a server cluster

, but almost any seasoned database developer can recall a need for such a query.

As you can see, the values of source are repeating so a mere GROUP BY won't work here.

In the systems supporting window functions there is a workaround for that:

Read the rest of this entry »

Written by Quassnoi

February 18th, 2011 at 11:00 pm

Posted in Miscellaneous

Things SQL needs: determining range cardinality

with 15 comments

What is the problem with this query?

SELECT  *
FROM    orders
WHERE   quantity <= 4
        AND urgency <= 4


The problem is indexing strategy, of course. Which columns should we index?

If we index quantity, the optimizer will be able to use the index to filter on it. However, filtering on urgency will require scanning all records with quantity < 4 and applying the urgency filter to each record found.

Same with urgency. We can use range access on urgency using an index, but this will require filtering on quantity.

Why, create a composite index!, some will say.

Unfortunately, that won't help much.

A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on (quantity, urgency) will sort on quantity, and only if the quantities are equal, it will take the urgency into account.

The picture below shows how would the records be ordered in such an index:

As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.

Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:
Read the rest of this entry »

Written by Quassnoi

May 19th, 2010 at 11:00 pm

Posted in Miscellaneous

Things SQL needs: MERGE JOIN that would seek

with 7 comments

One of the most known and least used join algorithms in SQL engines is MERGE JOIN.

This algorithm operates on two sorted recordsets, keeping two pointers that chase each other.

The Wikipedia entry above describes it quite well in terms of algorithms. I'll just make an animated GIF to make it more clear:

This is quite a nice and elegant algorithm, which, unfortunately, has two major drawbacks:

  1. It needs the recordsets to be sorted
  2. Even with the recordsets sorted, it is no better than a HASH JOIN

The sorting part is essential for this algorithm and there is nothing that can be done with it: the recordsets should be sorted, period. Databases, however, often provide the records in the sorted order: from clustered tables, indexes, previously sorted and ordered subqueries, spool tables etc.

But even when the recordsets are already sorted, on equijoins the MERGE JOIN is hardly faster than a HASH JOIN.

Why?
Read the rest of this entry »

Written by Quassnoi

May 7th, 2010 at 11:00 pm

Posted in Miscellaneous

NoSQL

with 6 comments

I had a vision tonight.

Image by nataliej

A huge, dark, grim figure approached me, seized me with its long bony arms and made me see all the vanity of the world we are living in.

Bloated database engines, useless ACID requirements, meaningless joins are now in the past for me.

I decided to move to NoSQL.

Where do I begin?

Written by Quassnoi

April 1st, 2010 at 1:00 pm

Posted in Miscellaneous

Things SQL needs: sargability of monotonic functions

with 5 comments

I'm going to write a series of articles about the things SQL needs to work faster and more efficienly.

With these things implemented, most of the tricks I explain in my blog will become redundant and I'll finally have some more time to spend with the family.

Ever wondered why a condition like this:

WHERE TRUNC(mydate) = TRUNC(SYSDATE)

, which searches for the current day's records, is so elegant but so slow?

Of course this is because even if you create an index on mydate, this index cannot be used.

The expression in the left part of the equality is not a mydate. The database engine cannot find a way to use an index to search for it. It is said that this expression is not sargable.

Now, a little explanation about the indexes and sargability. If you are familiar with these, you can skip this chapter. But beware that this chapter is the only one illustrated, so skipping it will make the article too boring to read.

Ahem.

To locate a record in a B-Tree index, the keys of the index should be compared to the value being searched for.

Let's consider this sample B-Tree index:


Read the rest of this entry »

Written by Quassnoi

February 19th, 2010 at 11:00 pm

Posted in Miscellaneous