Archive for June 30th, 2010
Indexing for ORDER BY / LIMIT
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Frode Underhill asks:
I have some applications that are logging to a MySQL database table.
The table is pretty standard on the form:
time
BIGINT(20)
source
TINYINT(4)
severity
ENUM
text
VARCHAR(255)
, where
source
identifies the system that generated the log entry.There are very many entries in the table (>100 million), of which 99.9999% are debug or info messages.
I'm making an interface for browsing this log, which means I'll be doing queries like
SELECT * FROM log WHERE source = 2 AND severity IN (1,2) AND time > 12345 ORDER BY time ASC LIMIT 30, if I want to find debug or info log entries from a certain point in time, or
SELECT * FROM log WHERE source = 2 AND severity IN (1,2) AND time < 12345 ORDER BY time DESC LIMIT 30for finding entries right before a certain time.
How would one go about indexing & querying such a table?
I thought I had it figured out (I pretty much just tried every different combination of columns in an index), but there's always some set of parameters that results in a really slow query.
The problem is that you cannot use a single index both for filtering and ordering if you have a ranged condition (severity IN (1, 2)
in this case).
Recently I wrote an article with a proposal to improve SQL optimizer to handle these conditions. If a range has low cardinality (this is, there are few values that con possibly satisfy the range), then the query could be improved by rewriting the range as a series of individual queries, each one using one of the values constituting the range in an equijoin:
No optimizers can handle this condition automatically yet, so we'll need to emulate it.
Since the severity
field is defined as an enum
with only 5 values possible, any range condition on this field can be satisfied by no more than 5 distinct values, thus making this table ideal for rewriting the query.
Let's create a sample table:
Read the rest of this entry »