EXPLAIN EXTENDED

How to create fast database queries

Archive for 2009

MySQL: LIMIT on LIMIT

Comments enabled. I *really* need your comment

From Stack Overflow:

Here's my situation: I want to select first ten entries from a database with id = $id.

But I want the results to be listed in a certain priority: if criteria = $criteria, then I want those results displayed first.

Otherwise, I just want to keep displaying the rest of the rows.

This can be done with a very simple query:

SELECT  *
FROM    t_preference
ORDER BY
        category = 1 DESC, id DESC
LIMIT 10

In this query, category = 1 is a boolean expression which returns TRUE when category is equal to 1, FALSE otherwise.

Since TRUE is 1 and FALSE is 0, TRUE is greater than FALSE from ORDER BY's point of view. We should apply DESC for the TRUE results (i. e. rows with category = 1) to be returned first.

However, this query is not very index friendly, and it will need to select all rows and sort them.

To improve performance, we should split this query into two index-friendly queries and merge them with a UNION.

Let's create a sample table and see how do we do it:

Read the rest of this entry »

Written by Quassnoi

August 5th, 2009 at 11:00 pm

Posted in MySQL

SQL Server: deleting adjacent values (improved)

Comments enabled. I *really* need your comment

Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:

A quick reminder of the problem, taken from Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

To delete adjacent values, we should select continous ranges of intervals and delete each even row from each continous range.

In the previous article, I used a subquery to do this:

WITH   rows AS
        (
        SELECT  *,
                ROW_NUMBER() OVER
                (
                PARTITION BY
                        (
                        SELECT  TOP 1 qi.id AS mint
                        FROM    [20090803_adjacent].records qi
                        WHERE   qi.unitid = qo.unitid
                                AND qi.day = qo.day
                                AND qi.interval < = qo.interval
                                AND NOT EXISTS
                                (
                                SELECT  NULL
                                FROM    [20090803_adjacent].records t
                                WHERE   t.unitid = qi.unitid
                                        AND t.day = qi.day
                                        AND t.interval = qi.interval - 1
                                )
                        ORDER BY
                                qi.interval DESC
                        )
                ORDER BY interval
                ) AS rnm
        FROM    [20090803_adjacent].records qo
        )
DELETE
FROM    rows
WHERE   rnm % 2 = 0

This subquery traverses adjacent intervals until it find the first interval that has no predecessor.

Its id then serves as a range identifier in a PARTITION BY clause.

However, this query can be improved.

Assume we have the following intervals:

interval
3
4
5
6
10
11
12
20
21

We have three ranges here: 3:6, 10:12 and 20:21.

To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER of these intervals:

interval ROW_NUMBER
3 1
4 2
5 3
6 4
10 5
11 6
12 7
20 8
21 9

ROW_NUMBERs, by definition, don't have gaps within the partition, while the intervals can have gaps. However, inside each range ROW_NUMBERs increase with intervals.

This means that we can just subtract ROW_NUMBER from the interval:

interval ROW_NUMBER ROW_NUMBER - interval
3 1 2
4 2 2
5 3 2
6 4 2
10 5 5
11 6 5
12 7 5
20 8 12
21 9 12

Within each range, this difference is a constant. When a gap occurs, this difference changes.

Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.

This should be more efficient since it does not use the subqueries.

Let's create a sample table and check:

Read the rest of this entry »

Written by Quassnoi

August 4th, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: deleting adjacent values

Comments enabled. I *really* need your comment

From Stack Overflow:

I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.

The table can contain many records (sometimes more than 500,000) so looping is too slow.

id unitId day interval amount
1 100 10 21 9.345
2 100 10 22 9.367
3 200 11 21 4.150
4 300 11 21 4.350
5 300 11 22 4.734
6 300 11 23 5.106
7 400 13 21 10.257
8 400 13 22 10.428

Values of (UnitID, Day, Interval) are unique.

In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).

Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.

The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:

SELECT  *
FROM    mytable qi
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable t
        WHERE   t.unitid = qi.unitid
                AND t.day = qi.day
                AND t.interval = qi.interval - 1
        )

However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.

If we just delete everything, the query above will not return any records just as well, but it's not our goal.

The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.

So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).

In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.

However, we can formulate the delete condition.

We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.

The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).

The second record of any range should be deleted, since it's adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).

The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.

In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.

The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.

It is best done by iterating the records up from the current and returning the id of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1). This can be done in a subquery right in the PARTITION BY clause of the ROW_NUMBER function.

Let's create a sample table and compose the query:

Read the rest of this entry »

Written by Quassnoi

August 3rd, 2009 at 11:00 pm

Posted in SQL Server

MySQL: DISTINCT on DISTINCT

Comments enabled. I *really* need your comment

From Stack Overflow:

I currently have the following query:

SELECT  group_concat (DISTINCT usrFirst, usrLast) as receiver_name
FROM    …

It outputs my list of names as follows:

JohnDoe,BillSmith,DaveJones

, while desired ouput would be:

John Doe, Bill Smith, and Dave Jones

Basically, I need help figuring out three things:

  1. How can I put a space, between the first and last name?
  2. How can insert a space after each comma?
  3. How can I add an and right before the last name displayed?

This kind of output should be really processed on the client (since comma and the and word are very cultire specific).

From MySQL, we just need a recordset of distinct full names.

Let's create a sample table and see how do we do it:

Read the rest of this entry »

Written by Quassnoi

August 1st, 2009 at 11:00 pm

Posted in MySQL

MySQL: counting items in pairs

Comments enabled. I *really* need your comment

From Stack Overflow:

I currently have an item table, and a pair table.

The pair table simply contains two columns, which contain the PRIMARY KEY from the item table. There is an index on id1, id2 on pair.

A common query is to find a number of items that are featured in the least number of pairs:

SELECT  id, COUNT(*) AS count
FROM    item i
LEFT JOIN
pair p
ON      (p.id1 = i.id OR p.id2 = i.id)
GROUP BY
id
ORDER BY
count, RAND()
LIMIT 100

, but the query is horible performance wise.

Is there a better query, and/or data structure for this type of thing?

MySQL is not very good in optimizing OR conditions.

It's capable of doing index_merge, which would UNION the results of two indexes, however, as documentation states, this access method works only when comparing the fields against the constants.

This method, therefore, can be used to serve an OR condition neither in a JOIN nor in a subquery.

Let's create sample tables and see the execution plan for the query:

Read the rest of this entry »

Written by Quassnoi

July 31st, 2009 at 11:00 pm

Posted in MySQL

PostgreSQL: making duplicate records unique

Comments enabled. I *really* need your comment

From Stack Overflow:

In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.

For integration with another system, I need to add a column to the table that makes these duplicate rows unique.

The table is essentially:

txn# detail# amount

I could just use an incrementing seq# for every existing row, however, the application changes will be to take the existing almost key (basically (txn#, detail#)), and increment the sequence number for this combination

So, e. g., I'll have two rows for (txn#, detail#) = (1, 1), and have seq# 1 for the first and 2 for the second.

If (txn#, detail#) = (513, 44) has 3 duplicate rows, these would have seq# 1, 2 and 3 appropriately.

We just need to update the table with a condition that would distinguish between the rows somehow.

Fortunatly, every PostgreSQL table has an implicit primary key, ctid.

In fact, it's the pointer to the place in the datafile where the row data resides.

In new PostgreSQL 8.4, we can employ a window function to assign a seq to each row.

This script fills the table with duplicate values (two duplicates for each (txn, detail) pair):

CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT);

CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail);

INSERT
INTO    t_dup (txn, detail, seq)
SELECT  t, d, NULL
FROM    generate_series(1, 100) t,
generate_series(1, 1000) d,
generate_series (1, 2) s;

ANALYZE t_dup;

Here's the query to update it:

UPDATE  ONLY t_dup
SET     seq = rn
FROM    (
SELECT  ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn
FROM    t_dup
) d
WHERE   t_dup.ctid = d.ctid

This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.

For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*).

Fortunately, ctids are comparable, and we can use less than (<) operator on them to build an ordered set.

This is less efficient, but still works (if the keys are almost unique, i. e. there are not many duplicates of each key):

UPDATE	ONLY t_dup d
SET	seq =
(
SELECT	COUNT(*)
FROM	t_dup di
WHERE	di.txn = d.txn
AND di.detail = d.detail
AND di.ctid &lt;= d.ctid
	);

This takes a little bit longer (15 seconds), however still works.

Written by Quassnoi

July 30th, 2009 at 11:00 pm

Posted in PostgreSQL

PostgreSQL 8.4: efficient MIN(COUNT(*))

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Frances asks:

I have a table user_views which contains pages viewed by users.

How do I select the user that had the most and the fewest number of page views?

If you need them at the same time, you'll just have to group by myuser, then order by COUNT(*).

With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:

WITH    q AS
(
SELECT  myuser, COUNT(*)
FROM    user_pages
GROUP BY
myuser
)
SELECT  *
FROM    (
SELECT  'max'::TEXT AS which, myuser, count
FROM    q
ORDER BY
count DESC
LIMIT 1
) qmax
UNION ALL
SELECT  *
FROM    (
SELECT  'min'::TEXT AS which, myuser, count
FROM    q
ORDER BY
count
LIMIT 1
) qmin

However, it's takes quite a long time if you table is big.

This is probably an answer to the question you asked.

But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.

The main idea here that we should accumulate the least value of COUNT(*) calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.

Let's create a sample table:

Read the rest of this entry »

Written by Quassnoi

July 29th, 2009 at 11:00 pm

Posted in PostgreSQL

SQL Server: random records avoiding CTE reevaluation

Comments enabled. I *really* need your comment

From Stack Overflow:

I need to get 5 random records from a table plus a further record based on data from the users preferences.

How can I combine the two statements whilst ensuring that no results are repeated (i. e. the favourite is not present in the 5 random records)?

We should select the 6th record so that it would satisfy two conditions: first, it should be preferred by user, second, it should not be among the 5 random records selected earlier.

The problem with random records it that they are, um, random. We usually cannot reproduce 5 random records to apply further conditions on them.

That's where the CTEs would come handy, but there's a little problem.

A CTE, as Microsoft understands them, is just an alias for an online view.

Therefore, each time a CTE is used in a query, it may (or may not) be reevaluated.

It's not a problem for a plain CTEs which are built out of static columns.

However, this is totally a problem when we try to UNION two queries using an undeterministic CTE (like, ordered by NEWID()), as we would do in this case.

Let's create a sample table and see what happens:

Read the rest of this entry »

Written by Quassnoi

July 28th, 2009 at 11:00 pm

Posted in SQL Server

Dynamic pivot

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a table and want to transpose its rows to columns, similar to a pivot table but without summarising.

For example I have the following tables:

Question (QuestionID, QuestionText)
Response (ResponseID, ResponseText, QuestionID)

Basically I want to be able to create a dynamic table something like:

Question 1 Question 2 Question 3
Response 1.1 Response 1.2 Response 1.3
Response 2.1 Response 2.2 Response 2.3
Response 3.1 Response 3.2 Response 3.3
Response 4.1 Response 4.2 Response 4.3

The main requirement would be I don't know at design time what the question text will be.

First, let's put it straight: SQL is not a right tool to do this.

SQL operates on sets. An SQL query is a functions that returns a set. And the column layout is an implicit property of a set. It should be defined in design time. This is the way SQL works.

However, there are several approaches that can simplify the task. One of them is using SQL Server's XML abilities.

To arrange the answers in a table, we need to make the following:

  1. Select all questions. This rowset will define the columns of our table.
  2. Select the row numbers for all answers for the question that has the most answers. This rowset will define the rows of our table. The number of rows wil be exactly as the number of answers to the most answered question.
  3. Assign the ordinal number to each answer on any question. This is done using ROW_NUMBER(). This value will uniquely define the position of the answer in a table.
  4. For each row (taken from the rowset made on step 2), take all the questions and LEFT JOIN the answers that need to be put into this row, using the ordinal number calculated on the step 3 as a LEFT JOIN condition. This will put either the answer or an empty string into the corresponding cell.
  5. Transform the rowset retrieved on step above into an XML string made out of <td>'s and enclosed into a <tr>
  6. Transform the resulting rowset into an XML made out of the XML's from the previous step and enclosed into a <table>

Let's create a sample table and see how can we do this:

Read the rest of this entry »

Written by Quassnoi

July 27th, 2009 at 11:00 pm

Posted in SQL Server

MySQL: selecting rows before and after filtered one

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a basic MySQL table, terms, comprised of an id and term field.

I want to create an alphabetically sorted dictionary index (in the literal sense), that would list ten 10 terms above the selected term, and 20 below it

An example of this could be found on Urban Dictionary where on the left column you see the current term highlighted, and a number of terms above it, and some below, all sorted alphabetically.

As we all know, MySQL doesn't support a ROW_NUMBER() or a similar function so we end up resorting to user variables and sub-selects.

I also cannot create a view with user defined variables because MySQL doesn't allow that.

MySQL indeed doesn't support ROW_NUMBER() and this function indeed can be emulated using session variables:

However, it will be an overkill in this case, since a more simple and more efficient set-based solution using LIMIT can be used instead.

Let's create a sample table and see how:

Read the rest of this entry »

Written by Quassnoi

July 25th, 2009 at 11:00 pm

Posted in MySQL