EXPLAIN EXTENDED

How to create fast database queries

Indexing for ORDER BY / LIMIT

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Frode Underhill asks:

I have some applications that are logging to a MySQL database table.

The table is pretty standard on the form:

time
BIGINT(20)
source
TINYINT(4)
severity
ENUM
text
VARCHAR(255)

, where source identifies the system that generated the log entry.

There are very many entries in the table (>100 million), of which 99.9999% are debug or info messages.

I'm making an interface for browsing this log, which means I'll be doing queries like

SELECT  *
FROM    log
WHERE   source = 2
AND severity IN (1,2)
AND time > 12345
ORDER BY
time ASC
LIMIT 30

, if I want to find debug or info log entries from a certain point in time, or

SELECT  *
FROM    log
WHERE   source = 2
AND severity IN (1,2)
AND time < 12345 
ORDER BY
        time DESC
LIMIT 30

for finding entries right before a certain time.

How would one go about indexing & querying such a table?

I thought I had it figured out (I pretty much just tried every different combination of columns in an index), but there's always some set of parameters that results in a really slow query.

The problem is that you cannot use a single index both for filtering and ordering if you have a ranged condition (severity IN (1, 2) in this case).

Recently I wrote an article with a proposal to improve SQL optimizer to handle these conditions. If a range has low cardinality (this is, there are few values that con possibly satisfy the range), then the query could be improved by rewriting the range as a series of individual queries, each one using one of the values constituting the range in an equijoin:

No optimizers can handle this condition automatically yet, so we'll need to emulate it.

Since the severity field is defined as an enum with only 5 values possible, any range condition on this field can be satisfied by no more than 5 distinct values, thus making this table ideal for rewriting the query.

Let's create a sample table:

Table creation details

This MyISAM table has 12,250,000 records, with 10 random sources (distributed evenly) and 5 random severities (distributed logarithmically):

SELECT  severity, COUNT(*)
FROM    t_log
GROUP BY
severity;

severity COUNT(*)
DEBUG 11024646
INFO 1102668
WARNING 110557
ERROR 10948
FATAL 1181

We also created two indexes (one on (source, ts), the other one on (source, severity, ts)).

Now, let's try to run some queries as is:

SELECT  *
FROM    t_log
WHERE   source = 2
AND severity IN (1, 2)
AND ts <= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30

id ts source severity tx
1205 1277753995 2 DEBUG Message 1205
1227 1277753973 2 DEBUG Message 1227
1243 1277753957 2 DEBUG Message 1243
1546 1277753654 2 DEBUG Message 1546
1575 1277753625 2 DEBUG Message 1575
30 rows fetched in 0.0013s (0.0027s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_log range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_ts 9 997923 100.00 Using where
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (1,2)) and (`20100630_range`.`t_log`.`ts` <= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30

This is very fast. It uses the index which does not include severity: since the values 1 and 2 are very frequent, it's much more efficient just to filter them out. The index preserves the order, that's why there is no filesort in the plan.

SELECT  *
FROM    t_log
WHERE   source = 2
AND severity IN (4, 5)
AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30

id ts source severity tx
2333 1277752867 2 ERROR Message 2333
6139 1277749061 2 ERROR Message 6139
6369 1277748831 2 FATAL Message 6369
297128 1277458072 2 ERROR Message 297128
298729 1277456471 2 ERROR Message 298729
30 rows fetched in 0.0013s (0.0093s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_log range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 10 1182 100.00 Using where; Using filesort
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (4,5)) and (`20100630_range`.`t_log`.`ts` <= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30

This is very fast too. The index which includes severity is used (along with the filesort of course, because the order cannot be preserved with multiple values of severity), but the total number of records evaluated is so small that the filesort is not much of a problem.

Now, let's try to include 3 into the query above:

SELECT  *
FROM    t_log
WHERE   source = 2
AND severity IN (3, 4)
AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30

id ts source severity tx
1507 1277753693 2 WARNING Message 1507
2333 1277752867 2 ERROR Message 2333
4154 1277751046 2 WARNING Message 4154
30118 1277725082 2 WARNING Message 30118
31321 1277723879 2 ERROR Message 31321
30 rows fetched in 0.0013s (0.2496s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_log range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 10 12168 100.00 Using where; Using filesort
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (3,4)) and (`20100630_range`.`t_log`.`ts` <= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30

Now, this runs for almost 250 ms. Why?

There are 110,557 records with severity = 'WARNING'. This is too many for a filesort but too few for using where (filtering the records with the index that preserves the order). There will be too many records that will need to be skipped.

To work around this, we could combine the queries using UNION ALL. Since the original query uses ORDER BY and LIMIT, we may put them into two separate queries (which will yield 60 records) and finally apply it to the end resultset (to get the 30 records that are guaranteed to be contained among these 60):

SELECT  *
FROM    (
SELECT  *
FROM    t_log
WHERE   source = 2
AND severity = 3
AND ts &lt;= 1277754000
        ORDER BY
                source DESC, ts DESC
        LIMIT 30
        ) q
UNION ALL
SELECT  *
FROM    (
        SELECT  *
        FROM    t_log
        WHERE   source = 2
                AND severity = 4
                AND ts &lt;= 1277754000
        ORDER BY
                source DESC, ts DESC
        LIMIT 30
        ) q
ORDER BY
        ts DESC
LIMIT 30
id ts source severity tx
1507 1277753693 2 WARNING Message 1507
2333 1277752867 2 ERROR Message 2333
4154 1277751046 2 WARNING Message 4154
30118 1277725082 2 WARNING Message 30118
31321 1277723879 2 ERROR Message 31321
30 rows fetched in 0.0013s (0.0037s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 30 100.00
2 DERIVED t_log range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 10 11094 100.00 Using where
3 UNION <derived4> ALL 30 100.00
4 DERIVED t_log range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 10 1074 100.00 Using where
UNION RESULT <union1,3> ALL Using filesort
select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` = 3) and (`20100630_range`.`t_log`.`ts` <= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30) `q` union all select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` = 4) and (`20100630_range`.`t_log`.`ts` <= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30) `q` order by `ts` desc limit 30

This is much faster.

However, this solution requires composing the query dynamically, depending on the number of the severities in the condition. Is it possible to make this all in one static query that will accept the parameters in the IN list?

We can do it by using the applying the solution using to retrieve greatest-n-per-group in MySQL.

To do this, we will just select the 30th timestamp of each severity and find all records with the higher timestamps.

This can be done using a join:

SELECT  *
FROM    (
SELECT  l.*
FROM    (
SELECT  source,
severity,
(
SELECT  ts
FROM    t_log li
WHERE   li.source = ss.source
AND li.severity = ss.severity
AND ts &lt;= 1277754000
                        ORDER BY
                                li.source DESC, li.severity DESC, li.ts DESC
                        LIMIT 29, 1
                        ) AS mts
                FROM    (
                        SELECT  DISTINCT source, severity
                        FROM    t_log
                        WHERE   source = 2
                                AND severity IN (3, 4)
                        ) ss
                ) s
        JOIN    t_log l
        ON      l.source &gt;= s.source
AND l.source &lt;= s.source
                AND l.severity = s.severity
                AND l.ts &gt;= s.mts
AND l.ts &lt;= 1277754000
        ) q
ORDER BY
        ts DESC
LIMIT 30;

id ts source severity tx
1507 1277753693 2 WARNING Message 1507
2333 1277752867 2 ERROR Message 2333
4154 1277751046 2 WARNING Message 4154
30118 1277725082 2 WARNING Message 30118
31321 1277723879 2 ERROR Message 31321
30 rows fetched in 0.0014s (0.0040s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 60 100.00 Using filesort
2 DERIVED <derived3> ALL 2 100.00
2 DERIVED l range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 10 30 40833332.00 Range checked for each record (index map: 0x3)
3 DERIVED <derived5> ALL 2 100.00
5 DERIVED t_log range ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 2 1 100.00 Using where; Using index for group-by
4 DEPENDENT SUBQUERY li ref ix_log_source_ts,ix_log_source_severity_ts ix_log_source_severity_ts 2 ss.source,ss.severity 245000 100.00 Using where; Using index
Field or reference 'ss.source' of SELECT #4 was resolved in SELECT #3
Field or reference 'ss.severity' of SELECT #4 was resolved in SELECT #3
select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`l`.`id` AS `id`,`20100630_range`.`l`.`ts` AS `ts`,`20100630_range`.`l`.`source` AS `source`,`20100630_range`.`l`.`severity` AS `severity`,`20100630_range`.`l`.`tx` AS `tx` from (select `ss`.`source` AS `source`,`ss`.`severity` AS `severity`,(select `20100630_range`.`li`.`ts` from `20100630_range`.`t_log` `li` where ((`20100630_range`.`li`.`source` = `ss`.`source`) and (`20100630_range`.`li`.`severity` = `ss`.`severity`) and (`20100630_range`.`li`.`ts` <= 1277754000)) order by `20100630_range`.`li`.`source` desc,`20100630_range`.`li`.`severity` desc,`20100630_range`.`li`.`ts` desc limit 29,1) AS `mts` from (select distinct `20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (3,4)))) `ss`) `s` join `20100630_range`.`t_log` `l` where ((`20100630_range`.`l`.`severity` = `s`.`severity`) and (`20100630_range`.`l`.`source` >= `s`.`source`) and (`20100630_range`.`l`.`source` <= `s`.`source`) and (`20100630_range`.`l`.`ts` >= `s`.`mts`) and (`20100630_range`.`l`.`ts` <= 1277754000))) `q` order by `q`.`ts` desc limit 30

All possible values of source and severity are selected using a loose scan (which is instant since there are few of them). Each pair of values is then used as a join condition. A single index range satisfies each pair of values, so each join iteration uses an index efficiently (actually, the access path is reevaluated for each iteration as shown by Range checked for each record (index map: 0x3).

The total number of records that would be returned by this query be there no LIMIT is 60 or maybe even more (in case of ties on ts). However, we don't need to resolve the ties in the subqueries, since the final ORDER BY / LIMIT does this for us.

The query completes in 4 ms which is instant. More than that, it does not need to be rewritten to handle different combinations of values: they could be provided in a single IN clause.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

June 30th, 2010 at 11:00 pm

Posted in MySQL

Leave a Reply