EXPLAIN EXTENDED

How to create fast database queries

Archive for November, 2009

SQL Server: Selecting records holding group-wise maximum

with one comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll consider SQL Server.

SQL Server has very rich SQL syntax and its optimizer is really powerful.

However, some tricks are still required to make the queries like this to run faster.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 30th, 2009 at 11:00 pm

Posted in SQL Server

PostgreSQL: Selecting records holding group-wise maximum

Comments enabled. I *really* need your comment

Continuing the series on selecting records holding group-wise maximums:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

In this article, I'll describe several ways to do this in PostgreSQL 8.4.

PostgreSQL 8.4 syntax is much richer than that of MySQL. The former can use the analytic functions, recursive CTE's and proprietary syntax extensions, all of which can be used for this task.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 26th, 2009 at 11:00 pm

Posted in PostgreSQL

MySQL: Selecting records holding group-wise maximum (resolving ties)

with one comment

Continuing the series on selecting records holding group-wise maximums.

The approach shown in the previous article is quite efficient, however, it can only be relied upon if the column being maximized is unique.

If the column is not unique (even along with the grouping column), the ties are possible: multiple records can satisfy the condition of holding a group-wise maximum.

In this article, I'll show how to resolve the ties.

Let's recreate the table we used in the previous article:
Read the rest of this entry »

Written by Quassnoi

November 25th, 2009 at 11:00 pm

Posted in MySQL

MySQL: Selecting records holding group-wise maximum (on a unique column)

with 2 comments

Answering questions asked on the site.

Juan asks:

Regarding this question:

I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.

I am using MySQL but I would like to know if PostgreSQL, Oracle, Microsoft SQL Server or DB2 are much more efficient.

A nice question, this is a common problem for most developers.

For those too lazy to follow the link, the problem in one sentence:

How do I select the whole records, grouped on grouper and holding a group-wise maximum (or minimum) on other column?

I've already covered this in passing in various previous articles in my blog, and now I'll try to make a little summary.

According to the SQL standards, every expression used in the SELECT list of a query that deals with GROUP BY should be either grouped by (that is, used in the GROUP BY) or aggregated.

The reason behind this is quite logical: a grouping query takes the records that have something in common and shrinks them, making a single record out of the whole set constituting the group.

The grouping expressions, of course, share their values across all the records constituting the group (since they are what that makes the group a group), and hence can be used as is.

Other expressions can vary and therefore some algorithm should be applied for shrinking all existing values into a single value. This is what aggregating function does.

The problem is that the the aggregating functions are not interdependent: you cannot use the result of one function as an input to another one. Hence, you cannot select a record in an aggregate query: you can only select a tuple of aggregated values. It's possible that each of these aggregated values in fact can never even found among the values of the original records. Say, with all values of column being strictly positive, you can never find a record holding a SUM(column): it will be greater than any of the values of column.

However, with MIN or MAX, it is guaranteed that at least one record holds the non-NULL value returned by these aggregate functions. The problem is to return the whole record without aggregating the other columns.

As SQL evolved, different solutions had been proposed to solve this problem, and I'll try to cover these solutions in this series of articles.

First, I'll cover MySQL (as per request)

MySQL's dialect of SQL is much more poor than those used by the other systems. That's why most of the solutions used by the other systems, like analytic functions, DISTINCT ON etc. cannot be used in MySQL.

However, MySQL still provides some ways to make the queries that solve this task more efficient.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 24th, 2009 at 11:00 pm

Posted in MySQL

Recursive CTE’s: PostgreSQL

Comments enabled. I *really* need your comment

In the previous article on recursive CTE's in SQL Server I demonstrated that they are not really set-based.

SQL Server implements the recursive CTE's syntax, but forbids all operations that do not distribute over UNION ALL, and each recursive step sees only a single record from the previous step.

Now, let's check the same operations in PostgreSQL 8.4.

To do this, we well write a query that selects only the very first branch of a tree: that is, each item would be the first child of its parent. To do this, we should select the item that would be the first child of the root, the select the first child of that item etc.

This is a set-based operation.

Oracle's CONNECT BY syntax, despite being set-based, offers some limited set-based capabilities: you can use ORDER SIBLINGS BY clause to define the order in which the siblings are returned. However, this would require some additional work to efficiently return only the first branch.

In a true set-based system, this is much more simple.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

SQL Server: are the recursive CTE’s really set-based?

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Karl asks:

I've read your article about adjacency lists vs. nested sets in Oracle. A nice explanation, thanks.

You write that recursive operations are row-based in Oracle and set-based in SQL Server.

Which approach is better in your opinion?

First, a quick reminder for those who have no idea what's this all about.

Recursive queries are useful to build the hierarchy trees, traverse graphs, generate arbitrary rowsets etc. Usually this involves joining a set with itself an arbitrary number of times.

A recursive query is usually defined by the anchor part (the initial rowset, or the base of recursion) and the recursive part (some operation over the previous resultset).

This is same as with the plain recursion: take some initial data as a parameter, modify it in some way and then pass the result of the modification as a new parameter to the same function.

Since SQL operates with sets, the input and the output if this function should also be the sets. And here is the main differences between the approaches used by Oracle on one side, and SQL Server, PostgreSQL and some more systems on the other side.

Oracle uses CONNECT BY approach, while SQL Server uses recursive CTE approach.
Read the rest of this entry »

Written by Quassnoi

November 18th, 2009 at 11:00 pm

Posted in SQL Server

Selecting most relevant row

Comments enabled. I *really* need your comment

From Stack Overflow:

I have whole bunch of contact data that may or may not contain all info, like this:

Name Address1 Address2 Email
Test User Address1 Address2 test@example.com
NULL NULL NULL test@example.com

For each email address, I would love to extract the most relevant record, that is the row with the contact information rather than the row with empty fields

To do this we need a sorting criterion: how to tell that a record is considered more relevant than another record. We can use the number of the fields filled as a criterion.

Let's create a sample table:

Table creation details

This table has 20,000 records with 20 distinct emails.

To find out the most filled values, we need to do the following:

  1. Select the list of the emails
  2. For each email, sort the records so that the most filled record goes first. To do this, we just cast the boolean values (like name IS NULL) to the integers and add them up
  3. Select the first record's id and join on this id

This guarantees that for each distinct email in the table, exactly one record will be selected (despite the actual number of records filled).

And here is the query:

SELECT  ai.*
FROM    (
        SELECT  DISTINCT  email
        FROM    address
        ) ad
JOIN    address ai
ON      ai.id =
        (
        SELECT  id
        FROM    address ao
        WHERE   ao.email = ad.email
        ORDER BY
                (name IS NULL) + (address1 IS NULL) + (address2 IS NULL)
        LIMIT 1
        )

id name address1 address2 email
18390 Name 18390 Address 1 18390 Address 2 18390 test10@example.com
3591 Name 3591 Address 1 3591 test11@example.com
2052 Name 2052 Address 2 2052 test12@example.com
19473 Name 19473 Address 1 19473 Address 2 19473 test13@example.com
8234 Name 8234 Address 1 8234 Address 2 8234 test14@example.com
14355 Name 14355 Address 1 14355 Address 2 14355 test15@example.com
6916 Name 6916 Address 1 6916 Address 2 6916 test16@example.com
8937 Name 8937 Address 1 8937 Address 2 8937 test17@example.com
14378 Name 14378 Address 1 14378 Address 2 14378 test18@example.com
3099 Address 1 3099 Address 2 3099 test19@example.com
2821 Name 2821 Address 1 2821 Address 2 2821 test1@example.com
17580 Name 17580 Address 1 17580 Address 2 17580 test20@example.com
16902 Name 16902 Address 1 16902 test2@example.com
17703 Name 17703 Address 1 17703 Address 2 17703 test3@example.com
4324 Name 4324 Address 1 4324 Address 2 4324 test4@example.com
2545 Name 2545 Address 1 2545 Address 2 2545 test5@example.com
1546 Name 1546 Address 1 1546 Address 2 1546 test6@example.com
18707 Name 18707 Address 1 18707 Address 2 18707 test7@example.com
12048 Address 1 12048 Address 2 12048 test8@example.com
1009 Name 1009 Address 1 1009 Address 2 1009 test9@example.com
20 rows fetched in 0.0010s (0.2543s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 20 100.00
1 PRIMARY ai eq_ref PRIMARY PRIMARY 4 func 1 100.00 Using where
3 DEPENDENT SUBQUERY ao ref ix_address_email ix_address_email 767 ad.email 646 100.00 Using where; Using filesort
2 DERIVED address range ix_address_email 767 31 100.00 Using index for group-by
Field or reference 'ad.email' of SELECT #3 was resolved in SELECT #1
select `20091117_relevant`.`ai`.`id` AS `id`,`20091117_relevant`.`ai`.`name` AS `name`,`20091117_relevant`.`ai`.`address1` AS `address1`,`20091117_relevant`.`ai`.`address2` AS `address2`,`20091117_relevant`.`ai`.`email` AS `email` from (select distinct `20091117_relevant`.`address`.`email` AS `email` from `20091117_relevant`.`address`) `ad` join `20091117_relevant`.`address` `ai` where (`20091117_relevant`.`ai`.`id` = (select `20091117_relevant`.`ao`.`id` AS `id` from `20091117_relevant`.`address` `ao` where (`20091117_relevant`.`ao`.`email` = `ad`.`email`) order by ((isnull(`20091117_relevant`.`ao`.`name`) + isnull(`20091117_relevant`.`ao`.`address1`)) + isnull(`20091117_relevant`.`ao`.`address2`)) limit 1))

As you can see, not all records are filled completely.

Nevertheless, the query selects exactly one record per email, and this is the most filled record for this email.

Written by Quassnoi

November 17th, 2009 at 11:00 pm

Posted in MySQL

Inverting date ranges: MySQL

with 3 comments

Answering questions asked on the site.

Maxym asks:

Inverting date ranges is an interesting article.

I had the same problem, and I solved it almost the same way in PostgreSQL. Your solution is great! The problem is to implement it on MySQL :) And of course performance there.

Do you have any idea how to transform this query for MySQL?

The query in question regards inverting contiguous non-overlapping date ranges: transforming them so the the gaps become the ranges and the ranges become the gaps.

The following set:

StartDate EndDate
2009-11-12 2009-11-13
2009-11-15 2009-11-19

should become this:

StartDate EndDate
NULL 2009-11-12
2009-11-13 2009-11-15
2009-11-19 NULL

The link above shows how to make this query in SQL Server and can be applied to Oracle and PostgreSQL 8.4 as well.

However, the query in question relies on FULL OUTER JOIN and ROW_NUMBER. MySQL lacks both of these.

To implement this query in MySQL we need to use a certain workaround.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 12th, 2009 at 11:00 pm

Posted in MySQL

SQL Server: clustered index and ordering

with 2 comments

From Stack Overflow:

My table (SQL Server 2008) has more than 1 million records.

When I try to order records by datetime, it takes 1 second, but when I order by id (integer), it only takes about 0.1 second.

Is there any way to improve the efficiency?

The obvious answer would be create an index on datetime

However, everything's not so simple.

Let's create a sample table and see why:
Read the rest of this entry »

Written by Quassnoi

November 11th, 2009 at 11:00 pm

Posted in SQL Server

MySQL: average of sums using indexes

Comments enabled. I *really* need your comment

From Stack Overflow:

My table looks like:

person_id car_id miles
1 1 100
1 2 200
2 3 1000
2 4 500

I need to total the miles for each person and then average those totals.

This is a good question to demonstrate how to improve MySQL queries using covering indexes.

The query should return a single value, which is SUM(miles) / COUNT(DISTINCT person_id). Since this query does not relate the fields to each other (only total sum and total count matter), it could be performed using only the indexes on miles and person_id. However, MySQL is not able to do this by default and it requires a little help from developer's side.

Let's create a sample table:

Table creation details

This table has 1,000,000 rows with 1,000 distinct person_id's. There are indexes on miles and person_id.

A plain query would look like this:

SELECT  SUM(miles) / COUNT(DISTINCT person_id) AS total
FROM    persons
total
4998778.6000
1 row fetched in 0.0001s (2.1468s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE persons ALL 1000066 100.00
select (sum(`20091110_indexes`.`persons`.`miles`) / count(distinct `20091110_indexes`.`persons`.`person_id`)) AS `total` from `20091110_indexes`.`persons`

The query takes more than 2 seconds.

If we look into the query plan we will see that a full table scan is used. The query needs both the values of person_id and miles and no index contains both values. MySQL browses all table records which are quite large.

But despite the fact that the query needs both values, it does not need to correspond the values. It may calculate the SUM first (using an index on miles), and COUNT(DISTINCT person_id) in another query. Since index records are much less in size that the table records, the separate aggregates on them will be calculated much faster.

And we can improve the query even more by replacing COUNT(DISTINCT person_id) with a COUNT over DISTINCT person_id in a subquery. The latter query will use INDEX FOR GROUP BY and therefore be almost instant.

Here's how the new improved query would look like:

SELECT  SUM(miles) /
        (
        SELECT  COUNT(*)
        FROM    (
                SELECT  DISTINCT person_id
                FROM    persons
                ) q
        ) AS total
FROM    persons
total
4998778.6000
1 row fetched in 0.0001s (0.7344s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY persons index ix_persons_miles 4 1000066 100.00 Using index
2 SUBQUERY Select tables optimized away
3 DERIVED persons range ix_persons_person 4 2021 100.00 Using index for group-by
select (sum(`20091110_indexes`.`persons`.`miles`) / (select count(0) AS `COUNT(*)` from (select distinct `20091110_indexes`.`persons`.`person_id` AS `person_id` from `20091110_indexes`.`persons`) `q`)) AS `total` from `20091110_indexes`.`persons`

This query uses separate indexes to calculate the aggregate values, and by using INDEX FOR GROUP-BY, COUNT(DISTINCT person_id) is calculated almost instantly.

Overall query performance is 750 ms, or thrice as fast as the original query.

Written by Quassnoi

November 10th, 2009 at 11:00 pm

Posted in MySQL