EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Things SQL needs: sargability of monotonic functions. You're welcome to read and comment on it.

Archive for the ‘MySQL’ Category

Selecting most relevant row

From Stack Overflow:

I have whole bunch of contact data that may or may not contain all info, like this:

Name Address1 Address2 Email
Test User Address1 Address2 test@example.com
NULL NULL NULL test@example.com

For each email address, I would love to extract the most relevant record, that is the row with the contact information rather than the row with empty fields

To do this we need a sorting criterion: how to tell that a record is considered more relevant than another record. We can use the number of the fields filled as a criterion.

Let’s create a sample table:

Table creation details

This table has 20,000 records with 20 distinct emails.

To find out the most filled values, we need to do the following:

  1. Select the list of the emails
  2. For each email, sort the records so that the most filled record goes first. To do this, we just cast the boolean values (like name IS NULL) to the integers and add them up
  3. Select the first record’s id and join on this id

This guarantees that for each distinct email in the table, exactly one record will be selected (despite the actual number of records filled).

And here is the query:

SELECT  ai.*
FROM    (
        SELECT  DISTINCT  email
        FROM    address
        ) ad
JOIN    address ai
ON      ai.id =
        (
        SELECT  id
        FROM    address ao
        WHERE   ao.email = ad.email
        ORDER BY
                (name IS NULL) + (address1 IS NULL) + (address2 IS NULL)
        LIMIT 1
        )
id name address1 address2 email
18390 Name 18390 Address 1 18390 Address 2 18390 test10@example.com
3591 Name 3591 Address 1 3591 test11@example.com
2052 Name 2052 Address 2 2052 test12@example.com
19473 Name 19473 Address 1 19473 Address 2 19473 test13@example.com
8234 Name 8234 Address 1 8234 Address 2 8234 test14@example.com
14355 Name 14355 Address 1 14355 Address 2 14355 test15@example.com
6916 Name 6916 Address 1 6916 Address 2 6916 test16@example.com
8937 Name 8937 Address 1 8937 Address 2 8937 test17@example.com
14378 Name 14378 Address 1 14378 Address 2 14378 test18@example.com
3099 Address 1 3099 Address 2 3099 test19@example.com
2821 Name 2821 Address 1 2821 Address 2 2821 test1@example.com
17580 Name 17580 Address 1 17580 Address 2 17580 test20@example.com
16902 Name 16902 Address 1 16902 test2@example.com
17703 Name 17703 Address 1 17703 Address 2 17703 test3@example.com
4324 Name 4324 Address 1 4324 Address 2 4324 test4@example.com
2545 Name 2545 Address 1 2545 Address 2 2545 test5@example.com
1546 Name 1546 Address 1 1546 Address 2 1546 test6@example.com
18707 Name 18707 Address 1 18707 Address 2 18707 test7@example.com
12048 Address 1 12048 Address 2 12048 test8@example.com
1009 Name 1009 Address 1 1009 Address 2 1009 test9@example.com
20 rows fetched in 0.0010s (0.2543s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 20 100.00
1 PRIMARY ai eq_ref PRIMARY PRIMARY 4 func 1 100.00 Using where
3 DEPENDENT SUBQUERY ao ref ix_address_email ix_address_email 767 ad.email 646 100.00 Using where; Using filesort
2 DERIVED address range ix_address_email 767 31 100.00 Using index for group-by
Field or reference 'ad.email' of SELECT #3 was resolved in SELECT #1
select `20091117_relevant`.`ai`.`id` AS `id`,`20091117_relevant`.`ai`.`name` AS `name`,`20091117_relevant`.`ai`.`address1` AS `address1`,`20091117_relevant`.`ai`.`address2` AS `address2`,`20091117_relevant`.`ai`.`email` AS `email` from (select distinct `20091117_relevant`.`address`.`email` AS `email` from `20091117_relevant`.`address`) `ad` join `20091117_relevant`.`address` `ai` where (`20091117_relevant`.`ai`.`id` = (select `20091117_relevant`.`ao`.`id` AS `id` from `20091117_relevant`.`address` `ao` where (`20091117_relevant`.`ao`.`email` = `ad`.`email`) order by ((isnull(`20091117_relevant`.`ao`.`name`) + isnull(`20091117_relevant`.`ao`.`address1`)) + isnull(`20091117_relevant`.`ao`.`address2`)) limit 1))

As you can see, not all records are filled completely.

Nevertheless, the query selects exactly one record per email, and this is the most filled record for this email.

Written by Quassnoi

November 17th, 2009 at 11:00 pm

Posted in MySQL

Inverting date ranges: MySQL

Answering questions asked on the site.

Maxym asks:

Inverting date ranges is an interesting article.

I had the same problem, and I solved it almost the same way in PostgreSQL. Your solution is great! The problem is to implement it on MySQL :) And of course performance there.

Do you have any idea how to transform this query for MySQL?

The query in question regards inverting contiguous non-overlapping date ranges: transforming them so the the gaps become the ranges and the ranges become the gaps.

The following set:

StartDate EndDate
2009-11-12 2009-11-13
2009-11-15 2009-11-19

should become this:

StartDate EndDate
NULL 2009-11-12
2009-11-13 2009-11-15
2009-11-19 NULL

The link above shows how to make this query in SQL Server and can be applied to Oracle and PostgreSQL 8.4 as well.

However, the query in question relies on FULL OUTER JOIN and ROW_NUMBER. MySQL lacks both of these.

To implement this query in MySQL we need to use a certain workaround.

Let’s create a sample table:
Read the rest of this entry »

Written by Quassnoi

November 12th, 2009 at 11:00 pm

Posted in MySQL

MySQL: average of sums using indexes

From Stack Overflow:

My table looks like:

person_id car_id miles
1 1 100
1 2 200
2 3 1000
2 4 500

I need to total the miles for each person and then average those totals.

This is a good question to demonstrate how to improve MySQL queries using covering indexes.

The query should return a single value, which is SUM(miles) / COUNT(DISTINCT person_id). Since this query does not relate the fields to each other (only total sum and total count matter), it could be performed using only the indexes on miles and person_id. However, MySQL is not able to do this by default and it requires a little help from developer’s side.

Let’s create a sample table:

Table creation details

This table has 1,000,000 rows with 1,000 distinct person_id’s. There are indexes on miles and person_id.

A plain query would look like this:

SELECT  SUM(miles) / COUNT(DISTINCT person_id) AS total
FROM    persons
total
4998778.6000
1 row fetched in 0.0001s (2.1468s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE persons ALL 1000066 100.00
select (sum(`20091110_indexes`.`persons`.`miles`) / count(distinct `20091110_indexes`.`persons`.`person_id`)) AS `total` from `20091110_indexes`.`persons`

The query takes more than 2 seconds.

If we look into the query plan we will see that a full table scan is used. The query needs both the values of person_id and miles and no index contains both values. MySQL browses all table records which are quite large.

But despite the fact that the query needs both values, it does not need to correspond the values. It may calculate the SUM first (using an index on miles), and COUNT(DISTINCT person_id) in another query. Since index records are much less in size that the table records, the separate aggregates on them will be calculated much faster.

And we can improve the query even more by replacing COUNT(DISTINCT person_id) with a COUNT over DISTINCT person_id in a subquery. The latter query will use INDEX FOR GROUP BY and therefore be almost instant.

Here’s how the new improved query would look like:

SELECT  SUM(miles) /
        (
        SELECT  COUNT(*)
        FROM    (
                SELECT  DISTINCT person_id
                FROM    persons
                ) q
        ) AS total
FROM    persons
total
4998778.6000
1 row fetched in 0.0001s (0.7344s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY persons index ix_persons_miles 4 1000066 100.00 Using index
2 SUBQUERY Select tables optimized away
3 DERIVED persons range ix_persons_person 4 2021 100.00 Using index for group-by
select (sum(`20091110_indexes`.`persons`.`miles`) / (select count(0) AS `COUNT(*)` from (select distinct `20091110_indexes`.`persons`.`person_id` AS `person_id` from `20091110_indexes`.`persons`) `q`)) AS `total` from `20091110_indexes`.`persons`

This query uses separate indexes to calculate the aggregate values, and by using INDEX FOR GROUP-BY, COUNT(DISTINCT person_id) is calculated almost instantly.

Overall query performance is 750 ms, or thrice as fast as the original query.

Written by Quassnoi

November 10th, 2009 at 11:00 pm

Posted in MySQL

MySQL ORDER BY / LIMIT performance: late row lookups

From Stack Overflow:

When I run an SQL command like the one below, it takes more than 15 seconds:

SELECT  *
FROM    news
WHERE   cat_id = 4
ORDER BY
        id DESC
LIMIT   150000, 10

EXPLAIN shows that its using where and the index on (cat_id, id)

LIMIT 20, 10 on the same query only takes several milliseconds.

This task can be reformulated like this: take the last 150,010 rows in id order and return the first 10 of them

It means that though we only need 10 records we still need to count off the first 150,000.

The table has an index which keeps the records ordered. This allows us not to use a filesort. However, the query is still far from being efficient: 15 seconds for 150,000 records (which are already ordered) is way too much.

To better understand the reason behind the low performance let’s look into this picture:

Structure
Read the rest of this entry »

Written by Quassnoi

October 23rd, 2009 at 11:00 pm

Posted in MySQL

IN list vs. range condition: MySQL

Answering questions asked on the site.

Princess asks:

Hello, I’ve got a problem with SQL.

I have a table which contains the production details for the factories. They are not factories and items of course but I cannot disclose the project and need to obfuscate so let’s pretend they are :)

I need to select the items for which the first 5 factories have low production rate.

I tried to do the query like this:

SELECT  ProductionItem
FROM    FactoryProductions
WHERE   5 >= FactoryID
        AND 100 >= ProductionAmount

which returns correct results but is slow.

I have an index on (FactoryID, ProductionAmount).

There are 13 Factories and 2,300,000 Items

This is in MySQL

This is a nice illustration of how index range evaluation works in MySQL and how to optimize it.

We will assume that the ProductionAmount is usually much higher than 100

Now, let’s create a sample table:
Read the rest of this entry »

Written by Quassnoi

October 7th, 2009 at 11:00 pm

Posted in MySQL

Bitwise operations and indexes

From Stack Overflow:

Are the following queries efficient in MySQL:

SELECT  *
FROM    table
WHERE   field & number = number
-- to find values with superset of number's bits
SELECT  *
FROM    table
WHERE   field | number = number
-- to find values with subset of number's bits

, if an index for the field has been created?

If not, is there a way to make it run faster?

An index can be used for the following things:

  1. To limit the number of records scanned
  2. To lower the number of row lookups

When doing a full table scan, every record should be fetched and examined. If the table contains say, 1,000,000 records, and each record is 100 bytes long, then 100 Mb worth of data should be processed. These data are usually scanned sequentially.

An index row, on the other hand, contains the key value, the row pointer and some additional information (pointers to the parents and children). On a MyISAM table, each row of the index on an INT column occupies 10 bytes (4 (sizeof(INT)) + 6 (default MyISAM pointer size)) plus some overhead (block headers, leaf pointers etc).

Even if we cannot build a range on the index and need to look over all index values, 10 Mb is far less than 100 Mb.

However, scanning the index has two drawbacks:

  1. Traversing the B-Tree is more slow than a full table scan, since the former is not sequential
  2. When the condition we search for matches the index key, we should follow the pointer to fetch the unindexed values from the table row. This also takes some time.

Both these things can be optimized:

  • Oracle has INDEX FAST FULL SCAN access method. It scans the index rows in their physical order, not logical. The key order is not maintained in such a scan, but we not always need it anyway.

    This makes the index scanning sequential (and taking the constant time)

  • PostgreSQL has bitmap access method. PostgreSQL cannot disregard the index order as Oracle does so it has to traverse the tree anyway. But instead of following the pointer to fetch the values it fills a special bitmap with a bit set for each matching row.

    The order of bits corresponds to the physical order of rows.

    When it’s over with setting bits it just reads the bitmap and fetches the rows with the bits set. Since the bits are ordered, the rows fetched are ordered too, so this scan is also sequential. Scanning the bitmap takes constant time (which is usually negligible since bitmaps are very small), pure time for fetching the rows depends on the number of bits set but is no longer than a full scan (since the scan is sequential).

MySQL, however, does neither of these optimizations.

Read the rest of this entry »

Written by Quassnoi

October 1st, 2009 at 11:00 pm

Posted in MySQL

Adjacency list vs. nested sets: MySQL

Continuing the series:

What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?

For detailed explanations of the terms, see the first article in the series:

This is the last article of the series which covers MySQL.

MySQL differs from the other systems, since it is the only system of the big four that does not support recursion natively. It has neither recursive CTE’s nor CONNECT BY clause, not even rowset returning functions that help to emulate recursion in PostgreSQL 8.3.

MySQL supports a thing that all other systems either lack or implement inefficiently: session variables. They can be set in a SELECT clause and can be used to keep some kind of a state between the rows as they are processed and returned in a rowset.

This of course is against the whole spirit of SQL, since SQL implies operations on whole sets and session variables operate on rows and are totally dependent on the order they are returned or processed. But if used properly, this behavior can be exploited to emulate some things that MySQL lacks: analytic functions, efficient random row sampling etc.

Hierarchical functions are among the things that need to be emulated in MySQL using session variables to keep the function state.

Here’s the old article in my blog that shows how to do this:

On the other hand, MySQL implements one more thing that is useful for nested sets model: SPATIAL indexes.
Read the rest of this entry »

Written by Quassnoi

September 29th, 2009 at 11:00 pm

Posted in MySQL

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL

Which method is best to select values present in one table but missing in another one?

This:

SELECT  l.*
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

, this:

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right r
        )

or this:

SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value
        )

Finally, it’s MySQL time.

As always, we will create the sample tables:

Read the rest of this entry »

Written by Quassnoi

September 18th, 2009 at 11:00 pm

Posted in MySQL

MySQL: emulating ROW_NUMBER with multiple ORDER BY conditions

Answering questions asked on the site.

Magerio asks:

Hello, thanks for your great post about advanced row sampling, it helped me a lot.

I have a question.

I am using your SQL query to show a portfolio of products and these products have a certain integer quality level (the lower, the better).

What if I want to show the best 3 products of each category?

Just a quick reminder: the article mentioned describes a query to emulate ROW_NUMBER() in MySQL. Since MySQL does not support analytic (window) functions natively, we had to do some kind of a trick to mimic their behaviour.

The original article described a query to obtain 15 first id’s for each grouper. The idea behind this query was quite simple:

  1. Take a list of groupers using DISTINCT
  2. For each grouper, find the 15th id using LIMIT in a correlated subquery. We have now a set of (grouper, id15) pairs: for any given grouper we have the id of the 15th row (ordered by id).
  3. Join this resultset with the main table so that for each grouper, the id’s less or equal to the 15th id are selected. We then will have 15 lowest id’s for each grouper that is exactly what we were after.

This emulated ROW_NUMBER() OVER (PARTITION BY grouper ORDER BY id): a function with a single PARTITION BY expression and a single ORDER BY expression.

But in your case there are multiple ORDER BY expressions. It may be not so obvious (since the question mentions only one ordering criterion, namely, quality). However, the solution above relied on the fact that the field in ORDER BY was unique, so we now that there are exactly 15 rows with id less or equal to the id returned with LIMIT 15. Since this is most probably not the case with quality, we should include id into the ORDER BY clause so that it makes the ordering conidition unique.

A subquery in MySQL, however, can only return a single value.

To work around this, we should use two subqueries that will return us a single column in an ORDER BY condition.

Let’s create a sample table:
Read the rest of this entry »

Written by Quassnoi

September 14th, 2009 at 11:00 pm

Posted in MySQL

MySQL: matching whole sets

From Stack Overflow:

I have a database that stores products available on the market and products still in development in two separate tables (market_product and dev_product).

A third table (substance) contains all substances a product can made of.

Other two tables (marked_product_comp and dev_product_comp) mantains product compositions.

I want to select products still in development that are made of the same ingredients of marketed products.

This means that we need to return every dev_product that has at least one market_product with complete substance match.

In the following example:

DevProductID Substance
1 Apple
1 Banana
2 Apple
2 Banana
2 Cabbage
MarketProductID Substance
1 Apple
2 Apple
2 Banana
3 Apple
3 Banana
3 Cabbage
3 Daikon

we should return development product 1, since it’s made of apples and bananas, and there is at least one product already on market (namely, market product 2) that is made of exactly same things and nothing else.

We should not return development product 3, because these is no complete match with market product 3, since is daikons are used to make the latter but not the former.

In one of the previous articles I wrote about a similar task in SQL Server:

However, this used FULL OUTER JOIN which MySQL lacks.

Let’s see how to implement this task in MySQL:

Read the rest of this entry »

Written by Quassnoi

September 8th, 2009 at 11:00 pm

Posted in MySQL