EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

Selecting last forum posts

Comments enabled. I *really* need your comment

From Stack Overflow:

I built a custom forum for my site using MySQL.

The listing page is essentially a table with the following columns: Topic, Last Updated, and Replies.

The DB table has the following columns:

id name body date topic_id email

A topic has the topic_id of 0, and replies have the topic_id of their parent topic.

How do I query 20 last updated topics efficiently?

We have two kinds of topics here: those that had been answered and those that had not.

The queries for these two kinds of topics should be optimized using different methods and it's better to query for them using two different queries merged later using UNION ALL:

  • To find last 20 replies, we will employ the technique described in this article:

    This technique uses a composite index on (date, id) to find 20 latest replies to distinct topic_id's:

    1. First, it scans the index on (date, id) to retrieve records with a non-zero topic_id in descending date order.
    2. For each record found on step 1, it scans the index on (date, id) again in descending order and finds the first record with the same topic_id.
    3. If the id's of records found on steps 1 and 2 match, this means that the record found on step 1 is the latest for this topic_id and we should return it.
  • To find last 20 unreplied topics, we just use a good old NOT EXISTS.

    In this case a composite index on (topic_id, date) will come handy: we will need to select the last 20 unreplied topics, and this index will be useful to implement both filtering on topic_id and ordering by date DESC.

  • Finally, we will merge these resultsets using UNION ALL and limit the result.

Now, let's create a sample table:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE wp_pod_tbl_forum (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(50) NOT NULL,
        body VARCHAR(1000) NOT NULL,
        date DATETIME NOT NULL,
        topic_id INT NOT NULL,
        email VARCHAR(100),
        KEY ix_forum_topic_date (topic_id, date),
        KEY ix_forum_date (date)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    wp_pod_tbl_forum (id, name, body, date, topic_id)
SELECT  id,
        CONCAT('Post ', id),
        RPAD(CONCAT('Message ', id), 1000, ' '),
        CAST('2009-07-07 00:00:00' AS DATETIME) - INTERVAL CASE WHEN id <= 10 THEN 0 ELSE RAND(20090707) * 100000 END MINUTE,
        CASE WHEN id <= 1000 THEN 0 ELSE FLOOR(RAND(20090707 << 1) * 990) + 10 END
FROM    filler;

We have a random-filled forum table with 100,000 records.

The first 1,000 records are topics, all other are replies.

First 10 topics are unreplied.

Post dates are completely random except for the first 9 unreplied topics, which are the latest records in the whole table.

Now, let's compose the query.

This query selects the topic_id and the date of 20 last replies to 20 distinct topics:

SELECT  topic_id, date AS last_reply
FROM    wp_pod_tbl_forum fo
WHERE   id = (
        SELECT  id
        FROM    wp_pod_tbl_forum fp
        WHERE   fp.topic_id = fo.topic_id
        ORDER BY
                fp.date DESC, fp.id DESC
        LIMIT 1
        )
        AND fo.topic_id <> 0
ORDER BY
        fo.date DESC, fo.id DESC
LIMIT 20

topic_id last_reply
700 2009-07-06 23:59:00
405 2009-07-06 23:58:00
329 2009-07-06 23:43:00
606 2009-07-06 23:43:00
20 rows fetched in 0.0005s (0.0065s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY fo index ix_forum_topic_date ix_forum_date 8 39 128228.20 Using where
2 DEPENDENT SUBQUERY fp index ix_forum_topic_date ix_forum_date 8 1 4800.00 Using where

To find the number of replies and the topic name, we need to join it with the table again and calculate the total number of replies.

The latter is better done in a SELECT list subquery rather than in a GROUP'ed aggregate:

SELECT  id, name, last_reply,
        (
        SELECT  COUNT(*)
        FROM    wp_pod_tbl_forum fc
        WHERE   fc.topic_id = fl.topic_id
        ) AS replies
FROM    (
        SELECT  topic_id, date AS last_reply
        FROM    wp_pod_tbl_forum fo
        WHERE   id = (
                SELECT  id
                FROM    wp_pod_tbl_forum fp
                WHERE   fp.topic_id = fo.topic_id
                ORDER BY
                        fp.date DESC, fp.id DESC
                LIMIT 1
                )
                AND fo.topic_id <> 0
        ORDER BY
                fo.date DESC, fo.id DESC
        LIMIT 20
        ) fl
JOIN    wp_pod_tbl_forum ft
ON      ft.id = fl.topic_id

id name last_reply replies
700 Post 700 2009-07-06 23:59:00 101
405 Post 405 2009-07-06 23:58:00 132
329 Post 329 2009-07-06 23:43:00 96
606 Post 606 2009-07-06 23:43:00 124
20 rows fetched in 0.0007s (0.0100s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived3> ALL 20 100.00
1 PRIMARY ft eq_ref PRIMARY PRIMARY 4 fl.topic_id 1 100.00
3 DERIVED fo index ix_forum_topic_date ix_forum_date 8 39 128228.20 Using where
4 DEPENDENT SUBQUERY fp index ix_forum_topic_date ix_forum_date 8 1 4800.00 Using where
2 DEPENDENT SUBQUERY fc ref ix_forum_topic_date ix_forum_topic_date 4 fl.topic_id 48 100.00 Using index

Now, let's find the records for last 20 unanswered topics. We do it by using NOT EXISTS:

SELECT  id, name, date AS last_reply, 0 AS replies
FROM    wp_pod_tbl_forum t
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    wp_pod_tbl_forum r
        WHERE   r.topic_id = t.id
        )
        AND t.topic_id = 0
ORDER BY
       date DESC, id DESC
LIMIT  20

id name last_reply replies
10 Post 10 2009-07-07 00:00:00 0
9 Post 9 2009-07-07 00:00:00 0
8 Post 8 2009-07-07 00:00:00 0
7 Post 7 2009-07-07 00:00:00 0
6 Post 6 2009-07-07 00:00:00 0
5 Post 5 2009-07-07 00:00:00 0
4 Post 4 2009-07-07 00:00:00 0
3 Post 3 2009-07-07 00:00:00 0
2 Post 2 2009-07-07 00:00:00 0
1 Post 1 2009-07-07 00:00:00 0
10 rows fetched in 0.0004s (0.0253s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t ref ix_forum_topic_date ix_forum_topic_date 4 const 2000 100.00 Using where
2 DEPENDENT SUBQUERY r ref ix_forum_topic_date ix_forum_topic_date 4 20090707_group.t.id 48 100.00 Using index

There are but 10 unreplied topics in the table.

Note that we are selecting a constant zero as replies, since there are no actual replies to these topics, but UNION ALL requires us to have this column.

Finally, we merge these two queries:

SELECT  id, name, last_reply,
        (
        SELECT  COUNT(*)
        FROM    wp_pod_tbl_forum fc
        WHERE   fc.topic_id = fl.topic_id
        ) AS replies
FROM    (
        SELECT  topic_id, date AS last_reply
        FROM    wp_pod_tbl_forum fo
        WHERE   id = (
                SELECT  id
                FROM    wp_pod_tbl_forum fp
                WHERE   fp.topic_id = fo.topic_id
                ORDER BY
                        fp.date DESC, fp.id DESC
                LIMIT 1
                )
                AND fo.topic_id <> 0
        ORDER BY
                fo.date DESC, fo.id DESC
        LIMIT 20
        ) fl
JOIN    wp_pod_tbl_forum ft
ON      ft.id = fl.topic_id
UNION ALL
SELECT  *
FROM    (
        SELECT  id, name, date AS last_reply, 0 AS replies
        FROM    wp_pod_tbl_forum t
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    wp_pod_tbl_forum r
                WHERE   r.topic_id = t.id
                )
                AND t.topic_id = 0
        ORDER BY
               date DESC, id DESC
        LIMIT  20
        ) fu
ORDER BY
        last_reply DESC, id DESC
LIMIT  20

id name last_reply replies
10 Post 10 2009-07-07 00:00:00 0
9 Post 9 2009-07-07 00:00:00 0
8 Post 8 2009-07-07 00:00:00 0
7 Post 7 2009-07-07 00:00:00 0
6 Post 6 2009-07-07 00:00:00 0
5 Post 5 2009-07-07 00:00:00 0
4 Post 4 2009-07-07 00:00:00 0
3 Post 3 2009-07-07 00:00:00 0
2 Post 2 2009-07-07 00:00:00 0
1 Post 1 2009-07-07 00:00:00 0
700 Post 700 2009-07-06 23:59:00 101
405 Post 405 2009-07-06 23:58:00 132
296 Post 296 2009-07-06 23:58:00 94
312 Post 312 2009-07-06 23:56:00 92
160 Post 160 2009-07-06 23:55:00 90
974 Post 974 2009-07-06 23:54:00 104
129 Post 129 2009-07-06 23:54:00 98
836 Post 836 2009-07-06 23:53:00 99
664 Post 664 2009-07-06 23:53:00 112
245 Post 245 2009-07-06 23:52:00 113
20 rows fetched in 0.0009s (0.0311s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived3> ALL 20 100.00
1 PRIMARY ft eq_ref PRIMARY PRIMARY 4 fl.topic_id 1 100.00
3 DERIVED fo index ix_forum_topic_date ix_forum_date 8 39 128228.20 Using where
4 DEPENDENT SUBQUERY fp index ix_forum_topic_date ix_forum_date 8 1 4800.00 Using where
2 DEPENDENT SUBQUERY fc ref ix_forum_topic_date ix_forum_topic_date 4 fl.topic_id 48 100.00 Using index
5 UNION <derived6> ALL 10 100.00
6 DERIVED t ref ix_forum_topic_date ix_forum_topic_date 4 2000 100.00 Using where
7 DEPENDENT SUBQUERY r ref ix_forum_topic_date ix_forum_topic_date 4 20090707_group.t.id 48 100.00 Using index
UNION RESULT <union1,5> ALL Using filesort

This query completes in 30 ms, which is allright even for a heavy loaded forum.

Note that this time hardly depends on the total number of columns in the system:

  • Both subqueries select first 20 values using appropriate indexes
  • NOT EXISTS uses the index on (topic_id, date) and stops on the first row
  • The only filesort we see in the plan sorts the final 40 records in constant time (and the modern computers are fast in sorting 40 records like these)

So as soon as the topics replies are balanced (i. e. you don't need to dig too deep into reply history to find 20 distinct topics), and there are no overgrown topics (which would hamper COUNT(*) performance), this query will complete in constant time which will most probably be dozens of milliseconds.

Written by Quassnoi

July 7th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply