<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>EXPLAIN EXTENDED &#187; MySQL</title>
	<atom:link href="http://explainextended.com/category/mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://explainextended.com</link>
	<description>How to create fast database queries</description>
	<lastBuildDate>Wed, 08 May 2013 19:35:11 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Empty row if condition does not match</title>
		<link>http://explainextended.com/2013/01/03/empty-row-if-condition-does-not-match/</link>
		<comments>http://explainextended.com/2013/01/03/empty-row-if-condition-does-not-match/#comments</comments>
		<pubDate>Thu, 03 Jan 2013 19:00:19 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5540</guid>
		<description><![CDATA[How to make a query always return a NULL row on a non-match]]></description>
				<content:encoded><![CDATA[<p>Just found that in a Google referer to the blog:</p>
<blockquote><p>I want SQL to return blank row even if the condition does not match</p></blockquote>
<p>This may be useful for certain ORMs which always expect a single row as a result of a query.</p>
<p>Say, we have a query like that:</p>
<pre class="brush: sql">
SELECT  *
FROM    mytable
WHERE   id = 42
</pre>
<p>and want it to return a single row (possibly consisting of NULL values) no matter what.</p>
<p>If we had a join and the condition in the <code>ON</code> clause:</p>
<pre class="brush: sql">
SELECT  m.*
FROM    values v
JOIN    mytable m
ON      m.id = v.value
</pre>
<p>, we could just rewrite an <code>INNER JOIN</code> to a <code>LEFT JOIN</code>.</p>
<pre class="brush: sql">
SELECT  m.*
FROM    values v
LEFT JOIN
        mytable m
ON      m.id = v.value
</pre>
<p>This way, we would have at least one record returned for each entry in <code>values</code>.</p>
<p>In our original query we don&#8217;t have a table to join with. But we can easily generate it:</p>
<pre class="brush: sql">
SELECT  m.*
FROM    (
        SELECT  42 AS value
        ) v
LEFT JOIN
        mytable m
ON      m.id = v.value
</pre>
<p>If <code>id</code> is a <code>PRIMARY KEY</code> on <code>mytable</code>, this query would return exactly one record, regardless of whether it such an id exists in <code>mytable</code> or not.</p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2013/01/03/empty-row-if-condition-does-not-match/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2013/01/03/empty-row-if-condition-does-not-match/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>MySQL: GROUP BY in UNION</title>
		<link>http://explainextended.com/2011/03/30/mysql-group-by-in-union/</link>
		<comments>http://explainextended.com/2011/03/30/mysql-group-by-in-union/#comments</comments>
		<pubDate>Wed, 30 Mar 2011 19:00:29 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5304</guid>
		<description><![CDATA[In MySQL, a GROUP BY used inside a UNION still sorts, though it should not. This degrades performance and cannot be turned off with ORDER BY NULL unless used along with a LIMIT large enough]]></description>
				<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/5486857/using-order-by-null-in-a-union"><strong>Stack Overflow</strong></a>:</p>
<blockquote>
<p>I have a query where I have a custom developed <strong>UDF</strong> that is used to calculate whether or not certain points are within a polygon (first query in <code>UNION</code>) or circular (second query in <code>UNION</code>) shape.</p>
<pre class="brush: sql">
SELECT  a.geo_boundary_id, …
FROM     geo_boundary_vertex a, …
…
GROUP BY
        a.geo_boundary_id
UNION
SELECT  b.geo_boundary_id, …
FROM     geo_boundary b, …
…
GROUP BY
        b.geo_boundary_id
</pre>
<p>When I run an explain for the query I get <code>filesort</code> for both queries within the <code>UNION</code>.</p>
<p>Now, I can split the queries up and use the <code>ORDER BY NULL</code> trick to get rid of the <code>filesort</code> however when I attempt to add that to the end of a <code>UNION</code> it doesn&#8217;t work.</p>
<p>How do I get rid of the <code>filesort</code>?</p>
</blockquote>
<p>In <strong>MySQL</strong>, <code>GROUP BY</code> also implies <code>ORDER BY</code> on the same set of expressions in the same order. That&#8217;s why it adds an additional <code>filesort</code> operation to sort the resultset if it does not come out naturally sorted (say, from an index).</p>
<p>This is not always a desired behavior, and <strong>MySQL</strong> manual suggests adding <code>ORDER BY NULL</code> to the queries where sorting is not required. This can improve performance of the queries.</p>
<p>Let&#8217;s create a sample table and see:</p>
<p><span id="more-5304"></span></p>
<p><a href="#" onclick="xcollapse('X7998');return false;">Table creation details</a><br />
</p>
<div id="X7998" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE grouping (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
        value1 INT NOT NULL,
        value2 INT NOT NULL
) ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    grouping (value1, value2)
SELECT  CEILING(RAND(20110330) * 300000),
        CEILING(RAND(20110330 &lt;&lt; 1) * 300000)
FROM    filler
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        LIMIT 30
        ) q;
</pre>
</div>
<p>The table contains <strong>3,000,000</strong> random records with <code>value1</code> and <code>value2</code> between <strong>1</strong> and <strong>300,000</strong>.</p>
<p>Here&#8217;s the plan we get with a mere <code>UNION</code> of two <code>GROUP BY</code> queries:</p>
<pre class="brush: sql">
SELECT  value1 AS value
FROM    grouping
GROUP BY
        value1
UNION
SELECT  value2 AS value
FROM    grouping
GROUP BY
        value2
LIMIT 10

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>value</th>
</tr>
<tr>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">6</td>
</tr>
<tr>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">8</td>
</tr>
<tr>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (8.4998s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">grouping</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3000279</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">UNION</td>
<td class="varchar">grouping</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3000279</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union1,2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` union select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` limit 10
</pre>
<p>We see that there is a <code>filesort</code> in each of the queries.</p>
<p><strong>MySQL</strong> does allow using <code>ORDER BY</code> in the queries merged with <code>UNION</code> or <code>UNION ALL</code>. To do this, we just need to wrap each query into a set of parentheses:</p>
<pre class="brush: sql">
(
SELECT  value1 AS value
FROM    grouping
GROUP BY
        value1
ORDER BY
        NULL
)
UNION
(
SELECT  value2 AS value
FROM    grouping
GROUP BY
        value2
ORDER BY
        NULL
)
LIMIT 10

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>value</th>
</tr>
<tr>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">6</td>
</tr>
<tr>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">8</td>
</tr>
<tr>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (8.4792s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">grouping</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3000279</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">UNION</td>
<td class="varchar">grouping</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3000279</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union1,2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
(select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` order by NULL) union (select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` order by NULL) limit 10
</pre>
<p>However, the plan remained the same. Why?</p>
<p><strong>MySQL</strong>&#8216;s <a href="http://dev.mysql.com/doc/refman/5.1/en/union.html"><strong>documentation</strong></a> says:</p>
<blockquote>
<p>Use of <code>ORDER BY</code> for individual <code>SELECT</code> statements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use of <code>ORDER BY</code> in this context is typically in conjunction with <code>LIMIT</code>, so that it is used to determine the subset of the selected rows to retrieve for the <code>SELECT</code>, even though it does not necessarily affect the order of those rows in the final <code>UNION</code> result. If <code>ORDER BY</code> appears without <code>LIMIT</code> in a <code>SELECT</code>, it is optimized away because it will have no effect anyway.</p>
</blockquote>
<p>This means that the optimizer just removes <code>ORDER BY</code> from the <code>UNION</code> parts if they are not used along with <code>LIMIT</code>.</p>
<p>This is of course a good idea: since individual <code>ORDER BY</code> have no effect on the order of the final query anyway, there is no use in executing them or even taking them into account.</p>
<p>However, this idea would be much better if the same were also true for <code>GROUP BY</code>. Currently, the optimizer does not optimize away the ordering behavior of the <code>GROUP BY</code> queries which are parts of a <code>UNION</code> and they cannot be cured with <code>ORDER BY NULL</code> (whose only goal is <strong>not</strong> to order) since this is removed by the optimizer.</p>
<p>However, since only <code>ORDER BY</code> clauses not accompanied with a <code>LIMIT</code> are thrown away, we could just add a <code>LIMIT</code>. Of course it should be large enough to guarantee that all record would be returned.</p>
<p>Let&#8217;s see:</p>
<pre class="brush: sql">
(
SELECT  value1 AS value
FROM    grouping
GROUP BY
        value1
ORDER BY
        NULL
LIMIT 10000000000
)
UNION
(
SELECT  value2 AS value
FROM    grouping
GROUP BY
        value2
ORDER BY
        NULL
LIMIT 10000000000
)
LIMIT 10

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>value</th>
</tr>
<tr>
<td class="integer">12462</td>
</tr>
<tr>
<td class="integer">205466</td>
</tr>
<tr>
<td class="integer">89941</td>
</tr>
<tr>
<td class="integer">133309</td>
</tr>
<tr>
<td class="integer">96722</td>
</tr>
<tr>
<td class="integer">83683</td>
</tr>
<tr>
<td class="integer">128249</td>
</tr>
<tr>
<td class="integer">90196</td>
</tr>
<tr>
<td class="integer">66232</td>
</tr>
<tr>
<td class="integer">60571</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (6.9842s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">grouping</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3000279</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">UNION</td>
<td class="varchar">grouping</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3000279</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union1,2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
(select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` order by NULL limit 10000000000) union (select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` order by NULL limit 10000000000) limit 10
</pre>
<p>Now, there are no <code>filesort</code> operations in the plan and the query runs <strong>20%</strong> faster.</p>
<p>This is not a very elegant solution of course. More than that, a solution similar to it was used for <strong>SQL Server 2000</strong> which does not allow using <code>ORDER BY</code> without a <code>TOP</code> in the inline views. In <strong>SQL Server 2000</strong>, <code>TOP 100%</code> forced the order of the nested queries and usually made them spooled or otherwise materialized.</p>
<p>It was quite a nasty surprise when <strong>SQL Server 2005</strong> has <q>improved</q> its optimizer to detect such tricks and ignore <code>ORDER BY</code> for <strong>TOP 100%</strong> queries. However, with all improvements introduced in <strong>SQL Server 2005</strong>, most of these queries could be just rewritten in a more clean and efficient way.</p>
<p>Nevertheless, this solution is still safe to use, because it does not change the semantics of the query (if <code>LIMIT</code> is chosen large enough), but is just an optimizer hack. In the worst case, the query will just become as slow as it initially was, and an extra <code>filesort</code> is not the worst of all things that can happen to a query.</p>
<p>Meanwhile, I&#8217;ve posted it as <a href="http://bugs.mysql.com/bug.php?id=60702">bug 60702</a> to <strong>MySQL</strong> bug database and am hoping they&#8217;ll fix it in the next release.</p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2011/03/30/mysql-group-by-in-union/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2011/03/30/mysql-group-by-in-union/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MySQL: splitting aggregate queries</title>
		<link>http://explainextended.com/2011/03/28/mysql-splitting-aggregate-queries/</link>
		<comments>http://explainextended.com/2011/03/28/mysql-splitting-aggregate-queries/#comments</comments>
		<pubDate>Mon, 28 Mar 2011 19:00:27 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5282</guid>
		<description><![CDATA[In MySQL, using MAX and MIN aggregates on different columns in a single query will prevent the GROUP BY optimization. To work around this, the queries should be split so that MAX and MIN on only one column are calculated in each query then joined on the grouping columns.]]></description>
				<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Victor</strong> asks:</p>
<blockquote>
<p>I have a table which I will call <code>sale</code> to protect the innocent:</p>
<table class="excel">
<caption>Sale</caption>
<tr>
<th>id</th>
<th>product</th>
<th>price</th>
<th>amount</th>
<th>date</th>
</tr>
</table>
<p>I need to retrieve ultimate values of <code>price</code>, <code>amount</code> and <code>date</code> for each product:</p>
<pre class="brush: sql">
SELECT  product,
        MIN(price), MAX(price),
        MIN(amount), MAX(amount),
        MIN(date), MAX(date)
FROM    sale
GROUP BY
        product
</pre>
<p>The query only returns about 100 records.</p>
<p>I have all these fields indexed (together with <code>product</code>), but this still produces a full table scan over <strong>3,000,000</strong> records.</p>
<p>How do I speed up this query?</p>
</blockquote>
<p>To retrieve the ultimate values of the fields, <strong>MySQL</strong> would just need to make a loose index scan over each index and take the max and min value of the field for each <code>product</code>.</p>
<p>However, the optimizer won&#8217;t do when multiple indexes are involved. Instead, it will revert to a full scan.</p>
<p>There is a workaround for it. Let&#8217;s create a sample table and see them:</p>
<p><span id="more-5282"></span></p>
<p><a href="#" onclick="xcollapse('X8367');return false;">Table creation details</a><br />
</p>
<div id="X8367" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE sale
        (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
        product INT NOT NULL,
        amount INT NOT NULL,
        price NUMERIC(20, 2) NOT NULL,
        dt DATETIME NOT NULL,
        KEY ix_sale_product_amount (product, amount),
        KEY ix_sale_product_price (product, price),
        KEY ix_sale_product_dt (product, dt)
        )
ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(50000);
COMMIT;

INSERT
INTO    sale (product, amount, price, dt)
SELECT  CEILING(RAND(20110227) * 100),
        CEILING(RAND(20110227 &lt;&lt; 1) * 1000) + 100,
        CEILING(RAND(20110227 &lt;&lt; 1 + 1) * 10000) / 100.00 + 20.00,
        &#039;2011-03-27&#039; - INTERVAL CEILING(RAND(20110227 &lt;&lt; 1 + 2) * 10000000) SECOND
FROM    filler
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        ORDER BY
                id
        LIMIT 60
        ) q;
</pre>
</div>
<p>This table contains <strong>3M</strong> random records for <strong>100</strong> distinct products and is indexed appropriately.</p>
<p>Here&#8217;s a straightforward query:</p>
<pre class="brush: sql">
SELECT  product,
        MIN(price), MAX(price),
        MIN(amount), MAX(amount),
        MIN(dt), MAX(dt)
FROM    sale
GROUP BY
        product

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>product</th>
<th>MIN(price)</th>
<th>MAX(price)</th>
<th>MIN(amount)</th>
<th>MAX(amount)</th>
<th>MIN(dt)</th>
<th>MAX(dt)</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="timestamp">2010-12-01 06:14:22</td>
<td class="timestamp">2011-03-26 23:42:32</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="timestamp">2010-12-01 06:39:38</td>
<td class="timestamp">2011-03-26 23:58:25</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="timestamp">2010-12-01 06:13:34</td>
<td class="timestamp">2011-03-26 23:54:32</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="timestamp">2010-12-01 06:14:30</td>
<td class="timestamp">2011-03-26 23:58:45</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">100</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="timestamp">2010-12-01 06:32:58</td>
<td class="timestamp">2011-03-26 23:58:24</td>
</tr>
<tr class="statusbar">
<td colspan="100">100 rows fetched in 0.0052s (13.8279s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">sale</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_sale_product_amount</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">3000409</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`price`) AS `MIN(price)`,max(`20110327_split`.`sale`.`price`) AS `MAX(price)`,min(`20110327_split`.`sale`.`amount`) AS `MIN(amount)`,max(`20110327_split`.`sale`.`amount`) AS `MAX(amount)`,min(`20110327_split`.`sale`.`dt`) AS `MIN(dt)`,max(`20110327_split`.`sale`.`dt`) AS `MAX(dt)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`
</pre>
<p>As was expected, the query uses a full table scan (or, rather, an index scan) and takes almost <strong>14</strong> seconds to complete.</p>
<p>In order to make the engine use three separate indexes, we would need to split the query into three queries each searching for ultimate values on one column, and then combine the results.</p>
<p>Each of these queries would be a <code>MIN / MAX</code> query on a trailing column of a composite index, combined with a <code>GROUP BY</code> on the index&#8217;s leading columns, and, as such, would be subject to <a href="http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html#loose-index-scan">loose index scan optimization</a>.</p>
<p>To combine the results, we will of course just join them on <code>product</code>. This will be quite efficient too since the resultsets are going to be quite small (100 records each, exactly).</p>
<p>And here is the query:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  product, MIN(amount), MAX(amount)
        FROM    sale
        GROUP BY
                product
        ) qa
NATURAL JOIN
        (
        SELECT  product, MIN(price), MAX(price)
        FROM    sale
        GROUP BY
                product
        ) qp
NATURAL JOIN
        (
        SELECT  product, MIN(dt), MAX(dt)
        FROM    sale
        GROUP BY
                product
        ) qd

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>product</th>
<th>MIN(amount)</th>
<th>MAX(amount)</th>
<th>MIN(price)</th>
<th>MAX(price)</th>
<th>MIN(dt)</th>
<th>MAX(dt)</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="timestamp">2010-12-01 06:14:22</td>
<td class="timestamp">2011-03-26 23:42:32</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="timestamp">2010-12-01 06:39:38</td>
<td class="timestamp">2011-03-26 23:58:25</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="timestamp">2010-12-01 06:13:34</td>
<td class="timestamp">2011-03-26 23:54:32</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="timestamp">2010-12-01 06:14:30</td>
<td class="timestamp">2011-03-26 23:58:45</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">100</td>
<td class="integer">101</td>
<td class="integer">1100</td>
<td class="decimal">20.01</td>
<td class="decimal">120.00</td>
<td class="timestamp">2010-12-01 06:32:58</td>
<td class="timestamp">2011-03-26 23:58:24</td>
</tr>
<tr class="statusbar">
<td colspan="100">100 rows fetched in 0.0053s (0.0105s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using join buffer</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using join buffer</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">sale</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_sale_product_dt</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">sale</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_sale_product_price</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">sale</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_sale_product_amount</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">1067</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `qa`.`product` AS `product`,`qa`.`MIN(amount)` AS `MIN(amount)`,`qa`.`MAX(amount)` AS `MAX(amount)`,`qp`.`MIN(price)` AS `MIN(price)`,`qp`.`MAX(price)` AS `MAX(price)`,`qd`.`MIN(dt)` AS `MIN(dt)`,`qd`.`MAX(dt)` AS `MAX(dt)` from (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`amount`) AS `MIN(amount)`,max(`20110327_split`.`sale`.`amount`) AS `MAX(amount)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qa` join (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`price`) AS `MIN(price)`,max(`20110327_split`.`sale`.`price`) AS `MAX(price)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qp` join (select `20110327_split`.`sale`.`product` AS `product`,min(`20110327_split`.`sale`.`dt`) AS `MIN(dt)`,max(`20110327_split`.`sale`.`dt`) AS `MAX(dt)` from `20110327_split`.`sale` group by `20110327_split`.`sale`.`product`) `qd` where ((`qp`.`product` = `qa`.`product`) and (`qd`.`product` = `qa`.`product`))
</pre>
<p>Each individual query is executed with <code>using index for group-by</code>, that is jumping over the min and max values in a loose index scan.</p>
<p>The queries are combined in a join which, despite not using the indexes, is still very fast because only <strong>100</strong> records are joined (of course fitting into a join buffer).</p>
<p>The overall query time is only <strong>10 ms</strong>.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2011/03/28/mysql-splitting-aggregate-queries/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2011/03/28/mysql-splitting-aggregate-queries/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Late row lookups: InnoDB</title>
		<link>http://explainextended.com/2011/02/11/late-row-lookups-innodb/</link>
		<comments>http://explainextended.com/2011/02/11/late-row-lookups-innodb/#comments</comments>
		<pubDate>Fri, 11 Feb 2011 20:00:09 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5214</guid>
		<description><![CDATA[A trick to avoid early row lookups is also useful for InnoDB tables]]></description>
				<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Aryé</strong> asks:</p>
<blockquote><p>Thanks for your article about <a href="/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/">late row lookups</a> in <strong>MySQL</strong>.</p>
<p>I have two questions for you please:</p>
<ul>
<li>Is this workaround specific to <strong>MyISAM</strong> engine?</li>
<li>How does <strong>PostgreSQL</strong> handle this?</li>
</ul>
</blockquote>
<p>The questions concerns a certain workaround for <strong>MySQL</strong> <code>LIMIT … OFFSET</code> queries like this:</p>
<pre class="brush: sql">
SELECT  *
FROM    mytable
ORDER BY
        id
LIMIT   10
OFFSET  10000
</pre>
<p>which can be improved using a little rewrite:</p>
<pre class="brush: sql">
SELECT  m.*
FROM    (
        SELECT  id
        FROM    mytable
        ORDER BY
                id
        LIMIT   10
        OFFSET  10000
        ) q
JOIN    mytable m
ON      m.id = q.id
ORDER BY
        m.id
</pre>
<p>For the rationale behind this improvement, please read <a href="/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/">the original article</a>.</p>
<p>Now, to the questions.</p>
<p>The second questions is easy: <strong>PostgreSQL</strong> won&#8217;t pull the fields from the table until it really needs them. If a query involving an <code>ORDER BY</code> along with <code>LIMIT</code> and <code>OFFSET</code> is optimized to use the index for the <code>ORDER BY</code> part, the table lookups won&#8217;t happen for the records skipped.</p>
<p>Though <strong>PostgreSQL</strong> does not reflect the table lookups in the <code>EXPLAIN</code> output, a simple test would show us that they are done only <code>LIMIT</code> times, not <code>OFFSET + LIMIT</code> (like <strong>MySQL</strong> does).</p>
<p>Now, let&#8217;s try to answer the first question: will this trick improve the queries against an <strong>InnoDB</strong> table?</p>
<p>To do this, we will create a sample table:</p>
<p><span id="more-5214"></span></p>
<p><a href="#" onclick="xcollapse('X10348');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X10348" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE lookup (
        id INT NOT NULL PRIMARY KEY,
        value INT NOT NULL,
        shorttxt TEXT NOT NULL,
        longtxt TEXT NOT NULL
) ENGINE=InnoDB ROW_FORMAT=COMPACT;

CREATE INDEX
        ix_lookup_value
ON      lookup (value);

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    lookup
SELECT  id, CEILING(RAND(20110211) * 1000000),
        RPAD(&#039;&#039;, CEILING(RAND(20110211 &lt;&lt; 1) * 100), &#039;*&#039;),
        RPAD(&#039;&#039;, CEILING(8192 + RAND(20110211 &lt;&lt; 1) * 100), &#039;*&#039;)
FROM    filler;
</pre>
</div>
<p>There is one indexed <code>INT</code> column and two <code>TEXT</code> columns, <code>shorttxt</code> storing short strings (<strong>1</strong> to <strong>100</strong> characters), <code>longtxt</code> storing long strings (<strong>8193</strong> to <strong>8293</strong> characters).</p>
<p>Let&#8217;s run some queries against this table.</p>
<h3>PRIMARY KEY and the INT column</h3>
<p>No rewrite:</p>
<pre class="brush: sql">
SELECT  value
FROM    lookup
ORDER BY
        id
LIMIT   10
OFFSET  90000
</pre>
<p><a href="#" onclick="xcollapse('X6751');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X6751" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>value</th>
</tr>
<tr>
<td class="integer">12336</td>
</tr>
<tr>
<td class="integer">314476</td>
</tr>
<tr>
<td class="integer">535374</td>
</tr>
<tr>
<td class="integer">733443</td>
</tr>
<tr>
<td class="integer">61089</td>
</tr>
<tr>
<td class="integer">105117</td>
</tr>
<tr>
<td class="integer">342318</td>
</tr>
<tr>
<td class="integer">396237</td>
</tr>
<tr>
<td class="integer">954232</td>
</tr>
<tr>
<td class="integer">582449</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (0.0415s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select `20110211_late`.`lookup`.`value` AS `value` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`id` limit 90000,10
</pre>
</div>
<p>Rewrite:</p>
<p><a href="#" onclick="xcollapse('X5888');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X5888" style="display: none; background: transparent;">
<pre class="brush: sql">
SELECT  l.value
FROM    (
        SELECT  id
        FROM    lookup
        ORDER BY
                id
        LIMIT   10
        OFFSET  90000
        ) q
JOIN    lookup l
ON      l.id = q.id
ORDER BY
        q.id

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>value</th>
</tr>
<tr>
<td class="integer">12336</td>
</tr>
<tr>
<td class="integer">314476</td>
</tr>
<tr>
<td class="integer">535374</td>
</tr>
<tr>
<td class="integer">733443</td>
</tr>
<tr>
<td class="integer">61089</td>
</tr>
<tr>
<td class="integer">105117</td>
</tr>
<tr>
<td class="integer">342318</td>
</tr>
<tr>
<td class="integer">396237</td>
</tr>
<tr>
<td class="integer">954232</td>
</tr>
<tr>
<td class="integer">582449</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (0.0407s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">l</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select `20110211_late`.`l`.`value` AS `value` from (select `20110211_late`.`lookup`.`id` AS `id` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`id` limit 90000,10) `q` join `20110211_late`.`lookup` `l` where (`20110211_late`.`l`.`id` = `q`.`id`) order by `q`.`id`
</pre>
</div>
<p>As you can see, there is almost no difference (<strong>41 ms</strong> vs <strong>40 ms</strong>).</p>
<p><strong>InnoDB</strong> tables are clustered on their <code>PRIMARY KEY</code> columns, which means the the index on <code>id</code> (used to serve the <code>ORDER BY</code> condition) contains all the data the query needs. There is a negligible benefit from not lookup up the <code>value</code> columns at the index pages because the column is tiny and the index pages need to be read anyway.</p>
<h3>PRIMARY KEY and the short TEXT column</h3>
<p>Rewrite:</p>
<pre class="brush: sql">
SELECT  LENGTH(l.shorttxt)
FROM    (
        SELECT  id
        FROM    lookup
        ORDER BY
                id
        LIMIT   10
        OFFSET  90000
        ) q
JOIN    lookup l
ON      l.id = q.id
ORDER BY
        q.id
</pre>
<p><a href="#" onclick="xcollapse('X4628');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X4628" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>LENGTH(l.shorttxt)</th>
</tr>
<tr>
<td class="bigint">17</td>
</tr>
<tr>
<td class="bigint">41</td>
</tr>
<tr>
<td class="bigint">52</td>
</tr>
<tr>
<td class="bigint">39</td>
</tr>
<tr>
<td class="bigint">36</td>
</tr>
<tr>
<td class="bigint">65</td>
</tr>
<tr>
<td class="bigint">15</td>
</tr>
<tr>
<td class="bigint">78</td>
</tr>
<tr>
<td class="bigint">44</td>
</tr>
<tr>
<td class="bigint">85</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0003s (0.0401s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">l</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select length(`20110211_late`.`l`.`shorttxt`) AS `LENGTH(l.shorttxt)` from (select `20110211_late`.`lookup`.`id` AS `id` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`id` limit 90000,10) `q` join `20110211_late`.`lookup` `l` where (`20110211_late`.`l`.`id` = `q`.`id`) order by `q`.`id`
</pre>
</div>
<p>No rewrite:</p>
<pre class="brush: sql">
SELECT  LENGTH(shorttxt)
FROM    lookup
ORDER BY
        id
LIMIT   10
OFFSET  90000
</pre>
<p><a href="#" onclick="xcollapse('X966');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X966" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>LENGTH(shorttxt)</th>
</tr>
<tr>
<td class="bigint">17</td>
</tr>
<tr>
<td class="bigint">41</td>
</tr>
<tr>
<td class="bigint">52</td>
</tr>
<tr>
<td class="bigint">39</td>
</tr>
<tr>
<td class="bigint">36</td>
</tr>
<tr>
<td class="bigint">65</td>
</tr>
<tr>
<td class="bigint">15</td>
</tr>
<tr>
<td class="bigint">78</td>
</tr>
<tr>
<td class="bigint">44</td>
</tr>
<tr>
<td class="bigint">85</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (0.0925s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select length(`20110211_late`.`lookup`.`shorttxt`) AS `LENGTH(shorttxt)` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`id` limit 90000,10
</pre>
</div>
<p>There is quite a significant difference (<strong>92 ms</strong> vs <strong>40 ms</strong>) the reasons for which we will discuss a little bit later, after we see the results of the third query.</p>
<h3>PRIMARY KEY and the long TEXT column</h3>
<p>Rewrite:</p>
<pre class="brush: sql">
SELECT  LENGTH(l.longtxt)
FROM    (
        SELECT  id
        FROM    lookup
        ORDER BY
                id
        LIMIT   10
        OFFSET  90000
        ) q
JOIN    lookup l
ON      l.id = q.id
ORDER BY
        q.id
</pre>
<p><a href="#" onclick="xcollapse('X1865');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X1865" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>LENGTH(l.longtxt)</th>
</tr>
<tr>
<td class="bigint">8209</td>
</tr>
<tr>
<td class="bigint">8233</td>
</tr>
<tr>
<td class="bigint">8244</td>
</tr>
<tr>
<td class="bigint">8231</td>
</tr>
<tr>
<td class="bigint">8228</td>
</tr>
<tr>
<td class="bigint">8257</td>
</tr>
<tr>
<td class="bigint">8207</td>
</tr>
<tr>
<td class="bigint">8270</td>
</tr>
<tr>
<td class="bigint">8236</td>
</tr>
<tr>
<td class="bigint">8277</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (0.0396s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">l</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select length(`20110211_late`.`l`.`longtxt`) AS `LENGTH(l.longtxt)` from (select `20110211_late`.`lookup`.`id` AS `id` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`id` limit 90000,10) `q` join `20110211_late`.`lookup` `l` where (`20110211_late`.`l`.`id` = `q`.`id`) order by `q`.`id`
</pre>
</div>
<p>No rewrite:</p>
<pre class="brush: sql">
SELECT  LENGTH(longtxt)
FROM    lookup
ORDER BY
        id
LIMIT   10
OFFSET  90000
</pre>
<p><a href="#" onclick="xcollapse('X759');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X759" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>LENGTH(longtxt)</th>
</tr>
<tr>
<td class="bigint">8209</td>
</tr>
<tr>
<td class="bigint">8233</td>
</tr>
<tr>
<td class="bigint">8244</td>
</tr>
<tr>
<td class="bigint">8231</td>
</tr>
<tr>
<td class="bigint">8228</td>
</tr>
<tr>
<td class="bigint">8257</td>
</tr>
<tr>
<td class="bigint">8207</td>
</tr>
<tr>
<td class="bigint">8270</td>
</tr>
<tr>
<td class="bigint">8236</td>
</tr>
<tr>
<td class="bigint">8277</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0000s (30.3594s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select length(`20110211_late`.`lookup`.`longtxt`) AS `LENGTH(longtxt)` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`id` limit 90000,10
</pre>
</div>
<p>The query employing the trick runs for same <strong>40 ms</strong>, while the straightforward query takes as much as <strong>30</strong> seconds (<strong>30,359 ms</strong>, to be exact).</p>
<p>Why such a difference?</p>
<p>The reason is that <code>InnoDB</code>, despite the fact it stores the data in the clustered index, is still able to move some data out of the index. This is called <q>external storage</q>.</p>
<p>With <code>COMPACT</code> row format I used to create the tables, <code>InnoDB</code>, when trying to insert a record with two <code>TEXT</code> columns on the page, will try to fit both of them on the page. If this is not possible, then it will split the longest of the records in two parts: the first <strong>768</strong> bytes will be stored on the page and the remaining data will be stored on a separate page (or pages), with a pointer to these data stored in the original clustered index. This will be repeated until all <code>TEXT</code> columns fit on the page of there is no more space there (in which case an error would be thrown).</p>
<p>This means that all <code>TEXT</code> columns shorter than <strong>768</strong> bytes will be stored completely on the page, while those longer can be either split of stored as a whole (with at least first <strong>768</strong> bytes still being on the page).</p>
<p>With the column lengths chosen (<strong>1</strong> to <strong>100</strong> and <strong>8K</strong> to <strong>8K + 100</strong>, accordingly), it can be easily seen that <code>shorttxt</code> will <em>always</em> be stored on-page, while <code>longtxt</code> will <em>always</em> be split (since <code>InnoDB</code> allows at most <code>8K</code> per record (minus some overhead) to be stored on one page).</p>
<p>Now, this becomes more clear. As with <strong>MyISAM</strong>, the straightforward query involving <code>longtxt</code> should perform two page lookups per each record scanned: the first one on the clustered index, the second one on the external storage. This takes lots of time and may even spoil the <strong>InnoDB</strong> cache with unneeded data (which would lead to increased cache miss ratio).</p>
<p>The query on <code>shorttxt</code> is not that bad, since it only requires come extra CPU cycles per record to calculate the string length.</p>
<p>Now, let&#8217;s check one more query which orders by a secondary indexed field rather than <code>id</code>:</p>
<h3>Secondary index and the short text column</h3>
<p>Rewrite:</p>
<pre class="brush: sql">
SELECT  LENGTH(l.shorttxt)
FROM    (
        SELECT  id, value
        FROM    lookup
        ORDER BY
                value
        LIMIT   10
        OFFSET  90000
        ) q
JOIN    lookup l
ON      l.id = q.id
ORDER BY
        q.value
</pre>
<p><a href="#" onclick="xcollapse('X145');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X145" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>LENGTH(l.shorttxt)</th>
</tr>
<tr>
<td class="bigint">14</td>
</tr>
<tr>
<td class="bigint">85</td>
</tr>
<tr>
<td class="bigint">16</td>
</tr>
<tr>
<td class="bigint">34</td>
</tr>
<tr>
<td class="bigint">77</td>
</tr>
<tr>
<td class="bigint">78</td>
</tr>
<tr>
<td class="bigint">3</td>
</tr>
<tr>
<td class="bigint">49</td>
</tr>
<tr>
<td class="bigint">53</td>
</tr>
<tr>
<td class="bigint">60</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (0.0262s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">l</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_lookup_value</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select length(`20110211_late`.`l`.`shorttxt`) AS `LENGTH(l.shorttxt)` from (select `20110211_late`.`lookup`.`id` AS `id`,`20110211_late`.`lookup`.`value` AS `value` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`value` limit 90000,10) `q` join `20110211_late`.`lookup` `l` where (`20110211_late`.`l`.`id` = `q`.`id`) order by `q`.`value`
</pre>
</div>
<p>No rewrite:</p>
<pre class="brush: sql">
SELECT  LENGTH(shorttxt)
FROM    lookup
ORDER BY
        value
LIMIT   10
OFFSET  90000
</pre>
<p><a href="#" onclick="xcollapse('X8905');return false;"><strong>Query results</strong></a><br />
</p>
<div id="X8905" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>LENGTH(shorttxt)</th>
</tr>
<tr>
<td class="bigint">14</td>
</tr>
<tr>
<td class="bigint">85</td>
</tr>
<tr>
<td class="bigint">16</td>
</tr>
<tr>
<td class="bigint">34</td>
</tr>
<tr>
<td class="bigint">77</td>
</tr>
<tr>
<td class="bigint">78</td>
</tr>
<tr>
<td class="bigint">3</td>
</tr>
<tr>
<td class="bigint">49</td>
</tr>
<tr>
<td class="bigint">53</td>
</tr>
<tr>
<td class="bigint">60</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0002s (0.2663s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">lookup</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_lookup_value</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">90010</td>
<td class="double">622.36</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select length(`20110211_late`.`lookup`.`shorttxt`) AS `LENGTH(shorttxt)` from `20110211_late`.`lookup` order by `20110211_late`.`lookup`.`value` limit 90000,10
</pre>
</div>
<p>The execution times for these queries vary tenfold: <strong>26 ms</strong> vs <strong>266 ms</strong>.</p>
<p>As with <strong>MyISAM</strong>, the secondary index requires an extra lookup to retrieve the actual values from the table (the only difference is that any index is secondary in <strong>MyISAM</strong>, including that used to police the <code>PRIMARY KEY</code>).</p>
<p>The first query does not perform these row lookups on the skipped records and hence is ten times as fast. It is even faster than queries ordering on the <code>PRIMARY KEY</code>, because the secondary index contains significantly less data than the <code>PRIMARY KEY</code>, holds much more records per page, is hence more shallow and can be traversed faster.</p>
<p>The second query does perform the early row lookups, as usual.</p>
<h3>Conclusion</h3>
<p>A trick used to avoid early row lookups for the <code>LIMIT … OFFSET</code> queries is useful on <strong>InnoDB</strong> tables too, though to different extent, depending on the <code>ORDER BY</code> condition and the columns involved:</p>
<ul>
<li>It&#8217;s very useful on queries involving columns stored off-page (long <code>TEXT</code>, <code>BLOB</code> and <code>VARCHAR</code> columns)</li>
<li>It&#8217;s very useful on <code>ORDER BY</code> conditions served by secondary indexes</li>
<li>It&#8217;s quite useful on moderate sized columns (still stored on page) or CPU-intensive expressions</li>
<li>It&#8217;s almost useless on short columns without complex CPU-intensive processing</li>
</ul>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2011/02/11/late-row-lookups-innodb/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2011/02/11/late-row-lookups-innodb/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>10 things in MySQL (that won&#8217;t work as expected)</title>
		<link>http://explainextended.com/2010/11/03/10-things-in-mysql-that-wont-work-as-expected/</link>
		<comments>http://explainextended.com/2010/11/03/10-things-in-mysql-that-wont-work-as-expected/#comments</comments>
		<pubDate>Wed, 03 Nov 2010 20:00:48 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5003</guid>
		<description><![CDATA[10 things in MySQL (that won't work as expected).]]></description>
				<content:encoded><![CDATA[<p>(I just discovered <a href="http://www.cracked.com/">cracked.com</a>)</p>
<h3 class="cracked">#10. Searching for a NULL</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_2538-e1288838824800.jpg" alt="" title="LifeView" width="700" height="467" class="aligncenter size-full wp-image-5102 noborder" /></p>
<pre class="brush: sql">
SELECT  *
FROM    a
WHERE   a.column = NULL
</pre>
<p>In <strong>SQL</strong>, a <code>NULL</code> is never equal to anything, even another <code>NULL</code>. This query won&#8217;t return anything and in fact will be thrown out by the optimizer when building the plan.</p>
<p>When searching for <code>NULL</code> values, use this instead:</p>
<pre class="brush: sql">
SELECT  *
FROM    a
WHERE   a.column IS NULL
</pre>
<p><span id="more-5003"></span></p>
<h3 class="cracked">#9. LEFT JOIN with additional conditions</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_2692-e1288839117298.jpg" alt="" title="Street" width="700" height="467" class="aligncenter size-full wp-image-5103 noborder" /></p>
<pre class="brush: sql">
SELECT  *
FROM    a
LEFT JOIN
        b
ON      b.a = a.id
WHERE   b.column = &#039;something&#039;
</pre>
<p>A <code>LEFT JOIN</code> is like <code>INNER JOIN</code> except that it will return each record from <code>a</code> at least once, substituting missing fields from <code>b</code> with <code>NULL</code> values, if there are no actual matching records.</p>
<p>The <code>WHERE</code> condition, however, is evaluated after the <code>LEFT JOIN</code> so the query above checks <code>column</code> <em>after</em> it had been joined. And as we learned earlier, no <code>NULL</code> value can satisfy an equality condition, so the records from <code>a</code> without corresponding record from <code>b</code> will unavoidably be filtered out.</p>
<p>Essentially, this query is an <code>INNER JOIN</code>, only less efficient.</p>
<p>To match only the records with <code>b.column = 'something'</code> (while still returning all records from <code>a</code>), this condition should be moved into <code>ON</code> clause:</p>
<pre class="brush: sql">
SELECT  *
FROM    a
LEFT JOIN
        b
ON      b.a = a.id
        AND b.column = &#039;something&#039;
</pre>
<h3 class="cracked">#8. Less than a value but not a NULL</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_2919-e1288839226492.jpg" alt="" title="Restaurant" width="700" height="467" class="aligncenter size-full wp-image-5104 noborder" /></p>
<p>Quite often I see the queries like this:</p>
<pre class="brush: sql">
SELECT  *
FROM    b
WHERE   b.column &lt; &#039;something&#039;
        AND b.column IS NOT NULL
</pre>
<p>This is actually not an error: this query is valid and will do what&#8217;s intended. However, <code>IS NOT NULL</code> here is redundant.</p>
<p>If <code>b.column</code> is a <code>NULL</code>, then <code>b.column < 'something'</code> will never be satisfied, since any comparison to <code>NULL</code> evaluates to a boolean <code>NULL</code> and does not pass the filter.</p>
<p>It is interesting that this additional <code>NULL</code> check is never used for <q>greater than</q> queries (like in <code>b.column > 'something'</code>).</p>
<p>This is because <code>NULL</code> go first in <code>ORDER BY</code> in <strong>MySQL</strong> and hence are incorrectly considered <q>less</q> than any other value by some people.</p>
<p>This query can be simplified:</p>
<pre class="brush: sql">
SELECT  *
FROM    b
WHERE   b.column &lt; &#039;something&#039;
</pre>
<p>and will still never return a <code>NULL</code> in <code>b.column</code>.</p>
<h3 class="cracked">#7. Joining on NULL</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_3163-e1288839302867.jpg" alt="" title="Helicopter" width="700" height="467" class="aligncenter size-full wp-image-5105 noborder" /></p>
<pre class="brush: sql">
SELECT  *
FROM    a
JOIN    b
ON      a.column = b.column
</pre>
<p>When <code>column</code> is nullable in both tables, this query won't return a match of two <code>NULL</code>s for the reasons described above: no <code>NULL</code>s are equal.</p>
<p>Here's a query to do that:</p>
<pre class="brush: sql">
SELECT  *
FROM    a
JOIN    b
ON      a.column = b.column
        OR (a.column IS NULL AND b.column IS NULL)
</pre>
<p><strong>MySQL</strong>'s optimizer treats this as an equijoin and provides a special join condition, <code>ref_or_null</code>.</p>
<h3 class="cracked">#6. NOT IN with NULL values</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_4025-e1288839428333.jpg" alt="" title="Neva" width="700" height="467" class="aligncenter size-full wp-image-5106 noborder" /></p>
<pre class="brush: sql">
SELECT  a.*
FROM    a
WHERE   a.column NOT IN
        (
        SELECT column
        FROM    b
        )
</pre>
<p>This query will never return anything if there is but a single <code>NULL</code> in <code>b.column</code>. As with other predicates, both <code>IN</code> and <code>NOT IN</code> against <code>NULL</code> evaluate to <code>NULL</code>.</p>
<p>This should be rewritten using a <code>NOT EXISTS</code>:</p>
<pre class="brush: sql">
SELECT  a.*
FROM    a
WHERE   NOT EXISTS
        (
        SELECT NULL
        FROM    b
        WHERE   b.column = a.column
        )
</pre>
<p>Unlike <code>IN</code>, <code>EXISTS</code> always evaluates to either <code>true</code> or <code>false</code>.</p>
<h3 class="cracked">#5. Ordering random samples</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_8806-e1288839714383.jpg" alt="" title="Camel" width="700" height="467" class="aligncenter size-full wp-image-5109 noborder" /></p>
<pre class="brush: sql">
SELECT  *
FROM    a
ORDER BY
        RAND(), column
LIMIT 10
</pre>
<p>This query attempts to select <strong>10</strong> random records ordered by <code>column</code>.</p>
<p><code>ORDER BY</code> orders the output lexicographically: that is, the records are only ordered on the second expression when the values of the first expression are equal.</p>
<p>However, the results of <code>RAND()</code> are, well, random. It's infeasible that the values of <code>RAND()</code> will match, so ordering on <code>column </code> after <code>RAND()</code> is quite useless.</p>
<p>To order the randomly sampled records, use this query:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  *
        FROM    mytable
        ORDER BY
                RAND()
        LIMIT 10
        ) q
ORDER BY
        column
</pre>
<h3 class="cracked">#4. Sampling arbitrary record from a group</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_8273-e1288839506460.jpg" alt="" title="Inscription" width="700" height="467" class="aligncenter size-full wp-image-5107 noborder" /></p>
<p>This query intends to select one <code>column</code> from each group (defined by <code>grouper</code>)</p>
<pre class="brush: sql">
SELECT  DISTINCT(grouper), a.*
FROM    a
</pre>
<p><code>DISTINCT</code> is not a function, it's a part of <code>SELECT</code> clause. It applies to all columns in the <code>SELECT</code> list, and the parentheses here may just be omitted. This query may and will select the duplicates on <code>grouper</code> (if the values in at least one of the other columns differ).</p>
<p>Sometimes, it's worked around using this query (which relies on <strong>MySQL</strong>'s extensions to <code>GROUP BY</code>):</p>
<pre class="brush: sql">
SELECT  a.*
FROM    a
GROUP BY
        grouper
</pre>
<p>Unaggregated columns returned within each group are arbitrarily taken.</p>
<p>At first, this appears to be a nice solution, but it has quite a serious drawback. It relies on the assumption that all values returned, though taken arbitrarily from the group, will still belong to one record.</p>
<p>Though with current implementation is seems to be so, it's not documented and can be changed in any moment (especially if <strong>MySQL</strong> will ever learn to apply <code>index_union</code> after <code>GROUP BY</code>). So it's not safe to rely on this behavior.</p>
<p>This query would be easy to rewrite in a cleaner way if <strong>MySQL</strong> supported analytic functions. However, it's still possible to make do without them, if the table has a <code>PRIMARY KEY</code> defined:</p>
<pre class="brush: sql">
SELECT  a.*
FROM    (
        SELECT  DISTINCT grouper
        FROM    a
        ) ao
JOIN    a
ON      a.id = 
        (
        SELECT  id
        FROM    a ai
        WHERE   ai.grouper = ao.grouper
        LIMIT 1
        )
</pre>
<h3 class="cracked">#3. Sampling first record from a group</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_8468-e1288839620474.jpg" alt="" title="Thermae" width="700" height="467" class="aligncenter size-full wp-image-5108 noborder" /></p>
<p>This is a variation of the previous query:</p>
<pre class="brush: sql">
SELECT  a.*
FROM    a
GROUP BY
        grouper
ORDER BY
        MIN(id) DESC
</pre>
<p>Unlike the previous query, this one attempts to select the record holding the minimal <code>id</code>.</p>
<p>Again: it is not guaranteed that the unaggregated values returned by <code>a.*</code> will belong to a record holding <code>MIN(id)</code> (or even to a single record at all).</p>
<p>Here's how to do it in a clean way:</p>
<pre class="brush: sql">
SELECT  a.*
FROM    (
        SELECT  DISTINCT grouper
        FROM    a
        ) ao
JOIN    a
ON      a.id = 
        (
        SELECT  id
        FROM    a ai
        WHERE   ai.grouper = ao.grouper
        ORDER BY
                ai.grouper, ai.id
        LIMIT 1
        )
</pre>
<p>This query is just like the previous one but with <code>ORDER BY</code> added to ensure that the first record in <code>id</code> order will be returned.</p>
<h3 class="cracked">#2. IN and comma-separated list of values</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_9734-e1288839801704.jpg" alt="" title="Mushrooms" width="700" height="467" class="aligncenter size-full wp-image-5110 noborder" /></p>
<p>This query attempts to match the value of <code>column</code> against any of those provided in a comma-separated string:</p>
<pre class="brush: sql">
SELECT  *
FROM    a
WHERE   column IN (&#039;1, 2, 3&#039;)
</pre>
<p>This does not work because the string is not expanded in the <code>IN</code> list.</p>
<p>Instead, if column <code>column</code> is a <code>VARCHAR</code>, it is compared (as a string) to the whole list (also as a string), and of course will never match. If <code>column</code> is of a numeric type, the list is cast into the numeric type as well (and only the first item will match, at best).</p>
<p>The correct way to deal with this query would be rewriting it as a proper <code>IN</code> list</p>
<pre class="brush: sql">
SELECT  *
FROM    a
WHERE   column IN (1, 2, 3)
</pre>
<p>,  or as an inline view:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  1 AS id
        UNION ALL
        SELECT  2 AS id
        UNION ALL
        SELECT  3 AS id
        ) q
JOIN    a
ON      a.column = q.id
</pre>
<p>, but this is not always possible.</p>
<p>To work around this without changing the query parameters, one can use <code>FIND_IN_SET</code>:</p>
<pre class="brush: sql">
SELECT  *
FROM    a
WHERE   FIND_IN_SET(column, &#039;1,2,3&#039;)
</pre>
<p>This function, however, is not sargable and a full table scan will be performed on <code>a</code>.</p>
<h3 class="cracked">#1. LEFT JOIN with COUNT(*)</h3>
<p><img src="http://explainextended.com/wp-content/uploads/2010/11/MG_3971-e1288839937884.jpg" alt="" title="Nevsky" width="700" height="467" class="aligncenter size-full wp-image-5125 noborder" /></p>
<pre class="brush: sql">
SELECT  a.id, COUNT(*)
FROM    a
LEFT JOIN
        b
ON      b.a = a.id
GROUP BY
        a.id
</pre>
<p>This query intends to count number of matches in <code>b</code> for each record in <code>a</code>.</p>
<p>The problem is that <code>COUNT(*)</code> will never return a <strong>0</strong> in such a query. If there is no match for a certain record in <code>a</code>, the record will be still returned and counted.</p>
<p><code>COUNT</code> should be made to count only the actual records in <code>b</code>. Since <code>COUNT(*)</code>, when called with an argument, ignores <code>NULL</code>s, we can pass <code>b.a</code> to it. As a join key, it can never be a null in an actual match, but will be if there were no match:</p>
<pre class="brush: sql">
SELECT  a.id, COUNT(b.a)
FROM    a
LEFT JOIN
        b
ON      b.a = a.id
GROUP BY
        a.id
</pre>
<p><em><strong>P.S.</strong> In case you were wondering: no, the pictures don't have any special meaning. I just liked them.</em></p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2010/11/03/10-things-in-mysql-that-wont-work-as-expected/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/11/03/10-things-in-mysql-that-wont-work-as-expected/feed/</wfw:commentRss>
		<slash:comments>32</slash:comments>
		</item>
		<item>
		<title>Mixed ASC/DESC sorting in MySQL</title>
		<link>http://explainextended.com/2010/11/02/mixed-ascdesc-sorting-in-mysql/</link>
		<comments>http://explainextended.com/2010/11/02/mixed-ascdesc-sorting-in-mysql/#comments</comments>
		<pubDate>Tue, 02 Nov 2010 20:00:16 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=5001</guid>
		<description><![CDATA[A way to work around limitations of ASC / DESC indexes in MySQL.]]></description>
				<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/4020625/how-can-i-optimize-this-confusingly-slow-query-in-mysql"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>I have a table of blog posts, each with a foreign key back to it&#8217;s author. There are less than <strong>15,000</strong> entries in this table.</p>
<p>This query scans over <strong>19,000</strong> rows (per <code>EXPLAIN</code>), requires a <code>filesort</code> (that might be regular <strong>MySQL</strong> behavior), and takes over <strong>400ms</strong> to return <strong>5</strong> rows. possibly because of the complicated <code>WHERE</code> used to check if the item is actually published. </p>
<pre class="brush: sql">
SELECT  blog_post.id,
        blog_post.title,
        blog_post.author_id,
        blog_post.has_been_fact_checked,
        blog_post.published_date,
        blog_post.ordering,
        auth_user.username,
        auth_user.email
FROM    blog_post
INNER JOIN
        auth_user
ON      auth_user.id = blog_post.author_id
WHERE   blog_post.is_approved = 1
        AND blog_post.has_been_fact_checked = 1
        AND blog_post.published_date &lt;=  &#039;2010-10-25 22:40:05&#039;
ORDER BY
        blog_post.published_date DESC,
        blog_post.ordering ASC,
        blog_post.id DESC
LIMIT 5
</pre>
<p><!-- --><br />
How can I wrangle this query under control?</p>
</blockquote>
<p>This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the <code>ORDER BY</code> perfectly, so the whole query could be executed using an index scan without any filesorts.</p>
<p>The only problem is that we have the mixed directions in <code>ORDER BY</code>, and <strong>MySQL</strong> does not support <code>ASC / DESC</code> clauses for the indexes.</p>
<p>With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the <code>ordering</code> column, say, by creating its copy and storing <em>negative</em> values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That&#8217;s what many engines do, <strong>MediaWiki</strong> (which <strong>Wikipedia</strong> runs on) being one of <a href="http://www.mysqlperformanceblog.com/2006/05/09/descending-indexing-and-loose-index-scan/">the most well-known examples</a>.</p>
<p><q>This solution is nice</q>, I hear people saying, <q>but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think</q>.</p>
<p>OK, this is a good point. Let&#8217;s see what we can do with the current design, and, as always, create a sample table to do it:<br />
<span id="more-5001"></span><br />
<a href="#" onclick="xcollapse('X4226');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X4226" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE blog_post
        (
        id INT NOT NULL PRIMARY KEY,
        title TEXT NOT NULL,
        is_approved BOOLEAN NOT NULL,
        has_been_fact_checked BOOLEAN NOT NULL,
        published_date DATETIME,
        ordering INT,
        author_id INT NOT NULL,
        KEY ix_blogpost_approved_checked_published_ordering_id
                (
                is_approved,
                has_been_fact_checked,
                published_date,
                ordering
                )
        )
ENGINE=InnoDB;

CREATE TABLE auth_user
        (
        id INT NOT NULL PRIMARY KEY, 
        username VARCHAR(32) NOT NULL,
        email VARCHAR(256) NOT NULL
        )
ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    auth_user
SELECT  id, CONCAT(&#039;author&#039;, id), CONCAT(&#039;author&#039;, id, &#039;@example.com&#039;)
FROM    filler
LIMIT 1000;

INSERT
INTO    blog_post
SELECT  id, CONCAT(&#039;Post &#039;, id),
        RAND(20101102) &gt; 0.05,
        RAND(20101102 &lt;&lt; 1) &gt; 0.05,
        CAST(&#039;2010-11-02&#039; AS DATE) - INTERVAL RAND(20101102 &lt;&lt; 2) * 100000 HOUR,
        CEILING(RAND(20101102 &lt;&lt; 3) * 100),
        CEILING(RAND(20101102 &lt;&lt; 4) * 1000)
FROM    filler;
</pre>
</div>
<p>There are <strong>1,000</strong> authors and <strong>500,000</strong> blog posts in the respective tables. The blog posts intentionally have duplicates on <code>published_date</code>.</p>
<p>The original query:</p>
<pre class="brush: sql">
SELECT  blog_post.id,
        blog_post.title,
        blog_post.author_id,
        blog_post.has_been_fact_checked,
        blog_post.published_date,
        blog_post.ordering,
        auth_user.username,
        auth_user.email
FROM    blog_post
INNER JOIN
        auth_user
ON      auth_user.id = blog_post.author_id
WHERE   blog_post.is_approved = 1
        AND blog_post.has_been_fact_checked = 1
        AND blog_post.published_date &lt;= &#039;2010-10-25 22:40:05&#039;
ORDER BY
        blog_post.published_date DESC,
        blog_post.ordering ASC,
        blog_post.id DESC
LIMIT 5

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>title</th>
<th>author_id</th>
<th>has_been_fact_checked</th>
<th>published_date</th>
<th>ordering</th>
<th>username</th>
<th>email</th>
</tr>
<tr>
<td class="integer">156410</td>
<td class="blob">Post 156410</td>
<td class="integer">984</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">1</td>
<td class="varchar">author984</td>
<td class="varchar">author984@example.com</td>
</tr>
<tr>
<td class="integer">451417</td>
<td class="blob">Post 451417</td>
<td class="integer">140</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">8</td>
<td class="varchar">author140</td>
<td class="varchar">author140@example.com</td>
</tr>
<tr>
<td class="integer">262749</td>
<td class="blob">Post 262749</td>
<td class="integer">719</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">19</td>
<td class="varchar">author719</td>
<td class="varchar">author719@example.com</td>
</tr>
<tr>
<td class="integer">415157</td>
<td class="blob">Post 415157</td>
<td class="integer">430</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">22</td>
<td class="varchar">author430</td>
<td class="varchar">author430@example.com</td>
</tr>
<tr>
<td class="integer">307578</td>
<td class="blob">Post 307578</td>
<td class="integer">611</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">72</td>
<td class="varchar">author611</td>
<td class="varchar">author611@example.com</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0004s (0.4558s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">blog_post</td>
<td class="varchar">ref</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">2</td>
<td class="varchar">const,const</td>
<td class="bigint">189693</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">auth_user</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101102_desc.blog_post.author_id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select `20101102_desc`.`blog_post`.`id` AS `id`,`20101102_desc`.`blog_post`.`title` AS `title`,`20101102_desc`.`blog_post`.`author_id` AS `author_id`,`20101102_desc`.`blog_post`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`blog_post`.`published_date` AS `published_date`,`20101102_desc`.`blog_post`.`ordering` AS `ordering`,`20101102_desc`.`auth_user`.`username` AS `username`,`20101102_desc`.`auth_user`.`email` AS `email` from `20101102_desc`.`blog_post` join `20101102_desc`.`auth_user` where ((`20101102_desc`.`auth_user`.`id` = `20101102_desc`.`blog_post`.`author_id`) and (`20101102_desc`.`blog_post`.`has_been_fact_checked` = 1) and (`20101102_desc`.`blog_post`.`is_approved` = 1) and (`20101102_desc`.`blog_post`.`published_date` &lt;= &#39;2010-10-25 22:40:05&#39;)) order by `20101102_desc`.`blog_post`.`published_date` desc,`20101102_desc`.`blog_post`.`ordering`,`20101102_desc`.`blog_post`.`id` desc limit 5
</pre>
<p>As the asker said, this query takes <strong>455 ms</strong> and the <code>filesort</code> is used.</p>
<p>Now, as we cannot use a single index on mixed <code>ASC / DESC</code> ordering conditions, we should squeeze the most from our existing index.</p>
<p>The longest common prefix of the index and the <code>ORDER BY</code> expression (which implicitly includes all columns from the <code>WHERE</code> clause) is <code>(is_approved, has_been_fact_checked, published_date)</code>.</p>
<p>What does it mean? It means that the chunks of data sharing the distinct values of these three columns will always go in the same order both in the index and in the query, though the order may and will differ within each chunk.</p>
<p>See the queries below, first one with the initial <code>ORDER BY</code> sorting:</p>
<pre class="brush: sql">
SELECT  published_date, id, ordering
FROM    blog_post
WHERE   is_approved = 1
        AND has_been_fact_checked = 1
ORDER BY
        published_date DESC,
        ordering ASC,
        id DESC
LIMIT 16
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>published_date</th>
<th>id</th>
<th>ordering</th>
</tr>
<tr>
<td class="timestamp">2010-11-02 00:00:00</td>
<td class="integer">111429</td>
<td class="integer">97</td>
</tr>
<tr>
<td class="timestamp" colspan="3">&nbsp;</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">48638</td>
<td class="integer">6</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">245245</td>
<td class="integer">7</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">153099</td>
<td class="integer">19</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">334624</td>
<td class="integer">33</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">200844</td>
<td class="integer">45</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">163114</td>
<td class="integer">52</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">312847</td>
<td class="integer">54</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">205531</td>
<td class="integer">72</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">265795</td>
<td class="integer">94</td>
</tr>
<tr>
<td class="timestamp" colspan="3">&nbsp;</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 22:00:00</td>
<td class="integer">409475</td>
<td class="integer">24</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 22:00:00</td>
<td class="integer">213483</td>
<td class="integer">45</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 22:00:00</td>
<td class="integer">72571</td>
<td class="integer">57</td>
</tr>
<tr>
<td class="timestamp" colspan="3">&nbsp;</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 21:00:00</td>
<td class="integer">102956</td>
<td class="integer">32</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 21:00:00</td>
<td class="integer">422579</td>
<td class="integer">48</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 21:00:00</td>
<td class="integer">280297</td>
<td class="integer">98</td>
</tr>
</table>
</div>
<p>, the second one with the index sorting:</p>
<pre class="brush: sql">
SELECT  published_date, id, ordering
FROM    blog_post
WHERE   is_approved = 1
        AND has_been_fact_checked = 1
ORDER BY
        published_date DESC,
        ordering DESC,
        id DESC
LIMIT 16
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>published_date</th>
<th>id</th>
<th>ordering</th>
</tr>
<tr>
<td class="timestamp">2010-11-02 00:00:00</td>
<td class="integer">111429</td>
<td class="integer">97</td>
</tr>
<tr>
<td class="timestamp" colspan="3">&nbsp;</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">265795</td>
<td class="integer">94</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">205531</td>
<td class="integer">72</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">312847</td>
<td class="integer">54</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">163114</td>
<td class="integer">52</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">200844</td>
<td class="integer">45</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">334624</td>
<td class="integer">33</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">153099</td>
<td class="integer">19</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">245245</td>
<td class="integer">7</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 23:00:00</td>
<td class="integer">48638</td>
<td class="integer">6</td>
</tr>
<tr>
<td class="timestamp" colspan="3">&nbsp;</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 22:00:00</td>
<td class="integer">72571</td>
<td class="integer">57</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 22:00:00</td>
<td class="integer">213483</td>
<td class="integer">45</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 22:00:00</td>
<td class="integer">409475</td>
<td class="integer">24</td>
</tr>
<tr>
<td class="timestamp" colspan="3">&nbsp;</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 21:00:00</td>
<td class="integer">280297</td>
<td class="integer">98</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 21:00:00</td>
<td class="integer">422579</td>
<td class="integer">48</td>
</tr>
<tr>
<td class="timestamp">2010-11-01 21:00:00</td>
<td class="integer">102956</td>
<td class="integer">32</td>
</tr>
</table>
</div>
<p>As you can see, the chunks of data occupy the same places within both recordsets. Post dated <strong>2010-11-02 00:00:00</strong> is at row <strong>1</strong>. Those dated <strong>2010-11-01 23:00:00</strong> are at rows <strong>2</strong> — <strong>10</strong> etc. The orders differ only within the chunks, but not between chunks.</p>
<p>This means that if we select the same chunks as the original query does and reorder them, we get the same order.</p>
<p>And the chunks we can select quite efficiently using the index, since their order, as we shown above, is the same for both queries.</p>
<p>The original query had <code>LIMIT 5</code>, this means we have to select at most <strong>5</strong> chunks. To do this, we just need to count off <strong>5</strong> distinct dates satisfying the condition.</p>
<p>This can be done using a technique I described in one of my earlier posts:</p>
<ul>
<li><a href="/2010/08/24/20-latest-unique-records/">20 latest unique records</a></li>
</ul>
<h3>Selecting 5 dates</h3>
<p>Here&#8217;s a query to select <strong>5</strong> distinct dates less than the given one:</p>
<pre class="brush: sql">
SELECT  bp.published_date
FROM    blog_post bp
WHERE   bp.is_approved = 1
        AND bp.has_been_fact_checked = 1
        AND bp.published_date &lt; &#039;2010-10-25 22:40:05&#039;
        AND bp.id =
        (
        SELECT  id
        FROM    blog_post bpi
        WHERE   bpi.is_approved = 1
                AND bpi.has_been_fact_checked = 1
                AND bpi.published_date = bp.published_date
        ORDER BY
                bpi.is_approved DESC, bpi.has_been_fact_checked DESC, bpi.published_date DESC
        LIMIT 1
        )
ORDER BY
        bp.is_approved DESC, bp.has_been_fact_checked DESC, bp.published_date DESC
LIMIT 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>published_date</th>
</tr>
<tr>
<td class="timestamp">2010-10-25 22:00:00</td>
</tr>
<tr>
<td class="timestamp">2010-10-25 21:00:00</td>
</tr>
<tr>
<td class="timestamp">2010-10-25 20:00:00</td>
</tr>
<tr>
<td class="timestamp">2010-10-25 19:00:00</td>
</tr>
<tr>
<td class="timestamp">2010-10-25 18:00:00</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0001s (0.0018s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">bp</td>
<td class="varchar">range</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar"></td>
<td class="bigint">189693</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">bpi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar">const,const,20101102_desc.bp.published_date</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20101102_desc.bp.published_date&#39; of SELECT #2 was resolved in SELECT #1
select `20101102_desc`.`bp`.`published_date` AS `published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` &lt; &#39;2010-10-25 22:40:05&#39;) and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 5
</pre>
<p>This uses the index access and is very efficient.</p>
<h3>Selecting chunks of data</h3>
<p>Now we need to select all records within the chunks defined by the dates. This can be done with a simple <code>BETWEEN</code> condition. The upper bound of the <code>BETWEEN</code> will be the date that we used as the parameter to the query, and the lower one will be the last date of those selected on the previous stage.</p>
<p>Since it should be a scalar, we will just replace <code>LIMIT 5</code> (which returns 5 records) with <code>LIMIT 4, 1</code> (which returns the last of these <strong>5</strong> records) and put it into a scalar subquery:</p>
<pre class="brush: sql">
SELECT  *
FROM    blog_post bpo
WHERE   is_approved = 1
        AND has_been_fact_checked = 1
        AND published_date BETWEEN
        (
        SELECT  bp.published_date
        FROM    blog_post bp
        WHERE   bp.is_approved = 1
                AND bp.has_been_fact_checked = 1
                AND bp.published_date &lt; &#039;2010-10-25 22:40:05&#039;
                AND bp.id =
                (
                SELECT  id
                FROM    blog_post bpi
                WHERE   bpi.is_approved = 1
                        AND bpi.has_been_fact_checked = 1
                        AND bpi.published_date = bp.published_date
                ORDER BY
                        bpi.is_approved DESC, bpi.has_been_fact_checked DESC, bpi.published_date DESC
                LIMIT 1
                )
        ORDER BY
                bp.is_approved DESC, bp.has_been_fact_checked DESC, bp.published_date DESC
        LIMIT 4, 1
        )
        AND &#039;2010-10-25 22:40:05&#039;
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>title</th>
<th>is_approved</th>
<th>has_been_fact_checked</th>
<th>published_date</th>
<th>ordering</th>
<th>author_id</th>
</tr>
<tr>
<td class="integer">418694</td>
<td class="blob">Post 418694</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 18:00:00</td>
<td class="integer">16</td>
<td class="integer">579</td>
</tr>
<tr>
<td class="integer">354679</td>
<td class="blob">Post 354679</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 18:00:00</td>
<td class="integer">44</td>
<td class="integer">733</td>
</tr>
<tr>
<td class="integer">480570</td>
<td class="blob">Post 480570</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 18:00:00</td>
<td class="integer">70</td>
<td class="integer">62</td>
</tr>
<tr>
<td class="integer">402339</td>
<td class="blob">Post 402339</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 18:00:00</td>
<td class="integer">85</td>
<td class="integer">498</td>
</tr>
<tr>
<td class="integer">282148</td>
<td class="blob">Post 282148</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 18:00:00</td>
<td class="integer">87</td>
<td class="integer">152</td>
</tr>
<tr>
<td class="integer">358065</td>
<td class="blob">Post 358065</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 18:00:00</td>
<td class="integer">88</td>
<td class="integer">301</td>
</tr>
<tr>
<td class="timestamp" colspan="100">&nbsp;</td>
</tr>
<tr>
<td class="integer">328191</td>
<td class="blob">Post 328191</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 19:00:00</td>
<td class="integer">33</td>
<td class="integer">627</td>
</tr>
<tr>
<td class="integer">95345</td>
<td class="blob">Post 95345</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 19:00:00</td>
<td class="integer">39</td>
<td class="integer">497</td>
</tr>
<tr>
<td class="integer">73886</td>
<td class="blob">Post 73886</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 19:00:00</td>
<td class="integer">77</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="timestamp" colspan="100">&nbsp;</td>
</tr>
<tr>
<td class="integer">29391</td>
<td class="blob">Post 29391</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 20:00:00</td>
<td class="integer">13</td>
<td class="integer">259</td>
</tr>
<tr>
<td class="integer">632</td>
<td class="blob">Post 632</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 20:00:00</td>
<td class="integer">15</td>
<td class="integer">952</td>
</tr>
<tr>
<td class="timestamp" colspan="100">&nbsp;</td>
</tr>
<tr>
<td class="integer">93117</td>
<td class="blob">Post 93117</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 21:00:00</td>
<td class="integer">65</td>
<td class="integer">773</td>
</tr>
<tr>
<td class="integer">278821</td>
<td class="blob">Post 278821</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 21:00:00</td>
<td class="integer">72</td>
<td class="integer">49</td>
</tr>
<tr>
<td class="timestamp" colspan="100">&nbsp;</td>
</tr>
<tr>
<td class="integer">156410</td>
<td class="blob">Post 156410</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">1</td>
<td class="integer">984</td>
</tr>
<tr>
<td class="integer">451417</td>
<td class="blob">Post 451417</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">8</td>
<td class="integer">140</td>
</tr>
<tr>
<td class="integer">262749</td>
<td class="blob">Post 262749</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">19</td>
<td class="integer">719</td>
</tr>
<tr>
<td class="integer">415157</td>
<td class="blob">Post 415157</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">22</td>
<td class="integer">430</td>
</tr>
<tr>
<td class="integer">225161</td>
<td class="blob">Post 225161</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">72</td>
<td class="integer">398</td>
</tr>
<tr>
<td class="integer">307578</td>
<td class="blob">Post 307578</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">72</td>
<td class="integer">611</td>
</tr>
<tr>
<td class="integer">426932</td>
<td class="blob">Post 426932</td>
<td class="tinyint">1</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">92</td>
<td class="integer">64</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0012s (0.0041s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">bpo</td>
<td class="varchar">range</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar"></td>
<td class="bigint">20</td>
<td class="double">75.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">SUBQUERY</td>
<td class="varchar">bp</td>
<td class="varchar">range</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar"></td>
<td class="bigint">189693</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">bpi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar">20101102_desc.bp.published_date</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20101102_desc.bp.published_date&#39; of SELECT #3 was resolved in SELECT #2
select `20101102_desc`.`bpo`.`id` AS `id`,`20101102_desc`.`bpo`.`title` AS `title`,`20101102_desc`.`bpo`.`is_approved` AS `is_approved`,`20101102_desc`.`bpo`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`bpo`.`published_date` AS `published_date`,`20101102_desc`.`bpo`.`ordering` AS `ordering`,`20101102_desc`.`bpo`.`author_id` AS `author_id` from `20101102_desc`.`blog_post` `bpo` where ((`20101102_desc`.`bpo`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpo`.`is_approved` = 1) and (`20101102_desc`.`bpo`.`published_date` between (select `20101102_desc`.`bp`.`published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` &lt; &#39;2010-10-25 22:40:05&#39;) and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 4,1) and &#39;2010-10-25 22:40:05&#39;))
</pre>
<p>This query returns (almost instantly) <strong>20</strong> records in <strong>5</strong> chunks of different size (split for readability). Note that the chunks are arranged in order reverse to the order the dates were returned on the previous step: we omitted the <code>ORDER BY</code> clause and, by default, the index is traversed in forward (<code>ASC</code>) direction.</p>
<h3>Ordering and joining</h3>
<p>Now, all that&#8217;s left to do is to add the correct <code>ORDER BY</code> condition to rearrange the data within the chunks, join the authors table and apply the <code>LIMIT</code>:</p>
<pre class="brush: sql">
SELECT  bpo.id,
        bpo.title,
        bpo.author_id,
        bpo.has_been_fact_checked,
        bpo.published_date,
        bpo.ordering,
        au.username,
        au.email
FROM    blog_post bpo
JOIN    auth_user au
ON      au.id = bpo.author_id
WHERE   bpo.is_approved = 1
        AND bpo.has_been_fact_checked = 1
        AND bpo.published_date BETWEEN
        (
        SELECT  bp.published_date
        FROM    blog_post bp
        WHERE   bp.is_approved = 1
                AND bp.has_been_fact_checked = 1
                AND bp.published_date &lt; &#039;2010-10-25 22:40:05&#039;
                AND bp.id =
                (
                SELECT  id
                FROM    blog_post bpi
                WHERE   bpi.is_approved = 1
                        AND bpi.has_been_fact_checked = 1
                        AND bpi.published_date = bp.published_date
                ORDER BY
                        bpi.is_approved DESC, bpi.has_been_fact_checked DESC, bpi.published_date DESC
                LIMIT 1
                )
        ORDER BY
                bp.is_approved DESC, bp.has_been_fact_checked DESC, bp.published_date DESC
        LIMIT 4, 1
        )
        AND &#039;2010-10-25 22:40:05&#039;
ORDER BY
        bpo.published_date DESC,
        bpo.ordering ASC,
        bpo.id DESC
LIMIT 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>title</th>
<th>author_id</th>
<th>has_been_fact_checked</th>
<th>published_date</th>
<th>ordering</th>
<th>username</th>
<th>email</th>
</tr>
<tr>
<td class="integer">156410</td>
<td class="blob">Post 156410</td>
<td class="integer">984</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">1</td>
<td class="varchar">author984</td>
<td class="varchar">author984@example.com</td>
</tr>
<tr>
<td class="integer">451417</td>
<td class="blob">Post 451417</td>
<td class="integer">140</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">8</td>
<td class="varchar">author140</td>
<td class="varchar">author140@example.com</td>
</tr>
<tr>
<td class="integer">262749</td>
<td class="blob">Post 262749</td>
<td class="integer">719</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">19</td>
<td class="varchar">author719</td>
<td class="varchar">author719@example.com</td>
</tr>
<tr>
<td class="integer">415157</td>
<td class="blob">Post 415157</td>
<td class="integer">430</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">22</td>
<td class="varchar">author430</td>
<td class="varchar">author430@example.com</td>
</tr>
<tr>
<td class="integer">307578</td>
<td class="blob">Post 307578</td>
<td class="integer">611</td>
<td class="tinyint">1</td>
<td class="timestamp">2010-10-25 22:00:00</td>
<td class="integer">72</td>
<td class="varchar">author611</td>
<td class="varchar">author611@example.com</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0004s (0.0044s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">bpo</td>
<td class="varchar">range</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar"></td>
<td class="bigint">20</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">au</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101102_desc.bpo.author_id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">SUBQUERY</td>
<td class="varchar">bp</td>
<td class="varchar">range</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar"></td>
<td class="bigint">189693</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">bpi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">ix_blogpost_approved_checked_published_ordering_id</td>
<td class="varchar">11</td>
<td class="varchar">20101102_desc.bp.published_date</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20101102_desc.bp.published_date&#39; of SELECT #3 was resolved in SELECT #2
select `20101102_desc`.`bpo`.`id` AS `id`,`20101102_desc`.`bpo`.`title` AS `title`,`20101102_desc`.`bpo`.`author_id` AS `author_id`,`20101102_desc`.`bpo`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`bpo`.`published_date` AS `published_date`,`20101102_desc`.`bpo`.`ordering` AS `ordering`,`20101102_desc`.`au`.`username` AS `username`,`20101102_desc`.`au`.`email` AS `email` from `20101102_desc`.`blog_post` `bpo` join `20101102_desc`.`auth_user` `au` where ((`20101102_desc`.`au`.`id` = `20101102_desc`.`bpo`.`author_id`) and (`20101102_desc`.`bpo`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpo`.`is_approved` = 1) and (`20101102_desc`.`bpo`.`published_date` between (select `20101102_desc`.`bp`.`published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` &lt; &#39;2010-10-25 22:40:05&#39;) and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 4,1) and &#39;2010-10-25 22:40:05&#39;)) order by `20101102_desc`.`bpo`.`published_date` desc,`20101102_desc`.`bpo`.`ordering`,`20101102_desc`.`bpo`.`id` desc limit 5
</pre>
<p>The records have been rearranged (using a <code>filesort</code>) and a <code>LIMIT</code> was applied.</p>
<p>Since the records are only taken from <strong>5</strong> chunks with <strong>20</strong> records in total, the <code>filesort</code> is not a problem at all and the query completes in only <strong>4 ms</strong>.</p>
<h3>Conclusion</h3>
<p><strong>MySQL</strong> does not support <code>ASC / DESC</code> clauses in the indices.</p>
<p>However, with an <code>ORDER BY / LIMIT</code> condition that mixes the orders of columns, it is possible to use an index.</p>
<p>To do that, we should limit the recordset to the minimal set that would still retain the order on the leading columns that share one direction. This can be done by selecting distinct set of those leading columns up to the original limit, and filtering on them using the index.</p>
<p>Then we can revert to a <code>filesort</code> to order this (much smaller) set on the remaining columns, which would be much more efficient.</p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2010/11/02/mixed-ascdesc-sorting-in-mysql/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/11/02/mixed-ascdesc-sorting-in-mysql/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Correlated LEFT JOIN in a subquery</title>
		<link>http://explainextended.com/2010/10/20/correlated-left-join-in-a-subquery/</link>
		<comments>http://explainextended.com/2010/10/20/correlated-left-join-in-a-subquery/#comments</comments>
		<pubDate>Wed, 20 Oct 2010 19:00:20 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4983</guid>
		<description><![CDATA[Answering questions asked on the site. Beren asks: In my database, I store the information about the game matches. A match can be played between two players (mostly) or, occasionally, between two teams of 2, 3 or 4 players. I store the matches in a table like this: Game id player1 player2 type , where [...]]]></description>
				<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Beren</strong> asks:</p>
<blockquote><p>
In my database, I store the information about the game matches.</p>
<p>A match can be played between two players (mostly) or, occasionally, between two teams of <strong>2</strong>, <strong>3</strong> or <strong>4</strong> players.</p>
<p>I store the matches in a table like this:</p>
<table class="excel">
<caption>Game</caption>
<tr>
<th>id</th>
<th>player1</th>
<th>player2</th>
<th>type</th>
</tr>
</table>
<p>, where <code>type</code> is <code>ENUM(player, team)</code>.</p>
<p>If <code>type</code> is <code>player</code>, the ids of players are stored in the record; if type is <code>team</code>, those of teams are stored.</p>
<p>Now, the tricky part. For a given game, I need to select two lists of players from both sides, comma separated, be they single players or members of a team.</p>
<p>Are two separate queries required for this?
</p></blockquote>
<p>This of course can be easily done using a single query.</p>
<p>Let&#8217;s make a sample table and see:<br />
<span id="more-4983"></span><br />
<a href="#" onclick="xcollapse('X2860');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X2860" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE game
        (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL,
        player1 INT NOT NULL,
        player2 INT NOT NULL,
        type ENUM(&#039;player&#039;, &#039;team&#039;)
        )
ENGINE=InnoDB;

CREATE TABLE player
        (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
        )
ENGINE=InnoDB;

CREATE TABLE team
        (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
        )
ENGINE=InnoDB;

CREATE TABLE player_team
        (
        team INT NOT NULL,
        player INT NOT NULL,
        PRIMARY KEY (team, player)
        )
ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    player
SELECT  id, CONCAT(&#039;Player &#039;, id)
FROM    filler
ORDER BY
        id
LIMIT 100000;

INSERT
INTO    team
SELECT  id, CONCAT(&#039;Team &#039;, id)
FROM    filler
ORDER BY
        id
LIMIT 3000;

INSERT
INTO    player_team
SELECT  t.id,
        FLOOR(RAND(20101020) * 25000) * 4 + num
FROM    team t
JOIN    (
        SELECT  1 AS num
        UNION ALL
        SELECT  2
        UNION ALL
        SELECT  3
        UNION ALL
        SELECT  4
        ) n
ON      n.num &lt;= ((t.id - 1) % 3) + 2;

INSERT
INTO    game
SELECT  id, CONCAT(&#039;Game &#039;, id),
        CASE n
        WHEN 1 THEN
                CEILING(RAND(20101020 &lt;&lt; 2) * 100000)
        ELSE
                FLOOR(RAND(20101020 &lt;&lt; 3) * 1000) * 3 + n - 1
        END,
        CASE n
        WHEN 1 THEN
                CEILING(RAND(20101020 &lt;&lt; 4) * 100000)
        ELSE
                FLOOR(RAND(20101020 &lt;&lt; 5) * 1000) * 3 + n - 1
        END,
        CASE n
        WHEN 1 THEN
                &#039;player&#039;
        ELSE
                &#039;team&#039;
        END
FROM    (
        SELECT  id, -FLOOR(LOG10(POWER(30, -4) + RAND(20100930) * (1 - POWER(30, -4))) / LOG10(30)) AS n
        FROM    filler
        ) q;
</pre>
</div>
<p>The tables contain <strong>500,000</strong> games, <strong>100,000</strong> players and <strong>3,000</strong> teams.</p>
<p>The participants in the games are distributed logarithmically: <strong>95%</strong> of all games are single player, <strong>95%</strong> of the remaining games are played by <strong>2&times;2</strong> teams, etc.:</p>
<div class="terminal">
<table class="terminal">
<tr>
<th>players</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="bigint">483335</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="bigint">16111</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="bigint">539</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="bigint">15</td>
</tr>
</table>
</div>
<p>The model above may seem quite weird, since the single players can be assigned to the fake teams so that the joins will be always performed through the <code>player_team</code> table. However, it&#8217;s not required and in fact is less efficient. We don&#8217;t really need that extra lookup if we can find a player right away, using its id, and since the majority of games are single player, majority of queries will be too.</p>
<p>Let&#8217;s see how we can build a list of players&#8217; names. First, let&#8217;s do it for the <code>player1</code> part only:</p>
<pre class="brush: sql">
SELECT  g.id, g.name, GROUP_CONCAT(p.name SEPARATOR &#039;, &#039;) AS players
FROM    game g
LEFT JOIN
        player_team pt
ON      g.type = &#039;team&#039;
        AND pt.team = g.player1
JOIN    player p
ON      p.id = CASE g.type WHEN &#039;player&#039; THEN g.player1 WHEN &#039;team&#039; THEN pt.player END
GROUP BY
        g.id
LIMIT 50
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>players</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Game 1</td>
<td class="varchar">Player 88151</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="varchar">Game 2</td>
<td class="varchar">Player 60091</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="varchar">Game 3</td>
<td class="varchar">Player 36003</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="varchar">Game 4</td>
<td class="varchar">Player 99741</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="varchar">Game 5</td>
<td class="varchar">Player 90694</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">37</td>
<td class="varchar">Game 37</td>
<td class="varchar">Player 63486</td>
</tr>
<tr>
<td class="integer">38</td>
<td class="varchar">Game 38</td>
<td class="varchar">Player 78226, Player 82189</td>
</tr>
<tr>
<td class="integer">39</td>
<td class="varchar">Game 39</td>
<td class="varchar">Player 34156</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">48</td>
<td class="varchar">Game 48</td>
<td class="varchar">Player 66763</td>
</tr>
<tr>
<td class="integer">49</td>
<td class="varchar">Game 49</td>
<td class="varchar">Player 85215</td>
</tr>
<tr>
<td class="integer">50</td>
<td class="varchar">Game 50</td>
<td class="varchar">Player 25784</td>
</tr>
<tr class="statusbar">
<td colspan="100">50 rows fetched in 0.0014s (0.0029s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">g</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">50</td>
<td class="double">1004398.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">pt</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.g.player1</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,group_concat(`20101020_matches`.`p`.`name` separator &#39;, &#39;) AS `players` from `20101020_matches`.`game` `g` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`g`.`player1`) and (`20101020_matches`.`g`.`type` = &#39;team&#39;))) join `20101020_matches`.`player` `p` where (`20101020_matches`.`p`.`id` = coalesce(`20101020_matches`.`pt`.`player`,`20101020_matches`.`g`.`player1`)) group by `20101020_matches`.`g`.`id` limit 50
</pre>
<p>We see that the query returns the lists of players correctly both for single-player games and team ones (like the one you can see at <strong>Game 38</strong>). The query works in the following way:</p>
<ul>
<li>For each record from <code>game</code>, it determines whether the game is single player or a team match.</li>
<li>If the game is multiplayer, it joins the <code>player_team</code> table. The query is written so that the game type is a part of a <code>LEFT JOIN</code> condition which can be short-circuited by the optimizer. This means that if a game is a single player game, no actual lookup is performed against <code>player_team</code>: the optimizer will just optimize away this step, knowing for sure that it just needs to return a <code>NULL</code>.
</li>
<li>Finally, the <code>player</code> table is joined, using <code>CASE</code>. The case will evaluate to either <code>player_team.player</code> or to <code>game.player1</code> (depending on the game type).</li>
</ul>
<p>Theoretically, it could be replaced with a mere <code>COALESCE(player_team.player, game.player1)</code>; however, if the game type were <code>team</code> but the team itself were empty, this expression would incorrectly evaluate to <code>game.player1</code>, not regarding the game type. So it&#8217;s better to leave it as is.</p>
<p>Now, we have an aggregate of the players&#8217; names from the one side. How do we make the same for two sides?</p>
<p>Since we need to calculate aggregates for two different columns, it&#8217;s better to replace the <code>JOINs</code> with the correlated subqueries. However, there is a little problem.</p>
<p>Normally, a correlated subquery just replaces the <code>ON</code> condition with a <code>WHERE</code> condition. But this is with a plain <code>JOIN</code>, and we here have a <code>LEFT JOIN</code>. We cannot just replace it with <code>WHERE</code>: it won&#8217;t match anything.</p>
<p>Unfortunately, <code>MySQL</code> has a very limited ability of correlating the subqueries. For instance, a correlated value cannot be nested more than one level deep, neither can is be used in an <code>ON</code> clause of a join. That&#8217;s why we will need to make one extra join in our subqueries.</p>
<p>To rewrite the aggregate query as subqueries, we will need to join <code>game</code> to itselft in the subqueries and use the values of a joined table instead of the correlated values:</p>
<pre class="brush: sql">
SELECT  g.id, g.name,
        (
        SELECT  GROUP_CONCAT(p.name SEPARATOR &#039;, &#039;) AS players
        FROM    game gi
        LEFT JOIN  
                player_team pt
        ON      gi.type = &#039;team&#039;
                AND pt.team = gi.player1
        JOIN    player p
        ON      p.id = CASE gi.type WHEN &#039;player&#039; THEN gi.player1 WHEN &#039;team&#039; THEN pt.player END
        WHERE   gi.id = g.id
        ),
        (
        SELECT  GROUP_CONCAT(p.name SEPARATOR &#039;, &#039;) AS players
        FROM    game gi
        LEFT JOIN
                player_team pt
        ON      gi.type = &#039;team&#039;
                AND pt.team = gi.player2
        JOIN    player p
        ON      p.id = CASE gi.type WHEN &#039;player&#039; THEN gi.player2 WHEN &#039;team&#039; THEN pt.player END
        WHERE   gi.id = g.id
        )
FROM    game g
ORDER BY
        id
LIMIT 50
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>playerlist1</th>
<th>playerlist2</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Game 1</td>
<td class="varchar">Player 88151</td>
<td class="varchar">Player 6037</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="varchar">Game 2</td>
<td class="varchar">Player 60091</td>
<td class="varchar">Player 54100</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="varchar">Game 3</td>
<td class="varchar">Player 36003</td>
<td class="varchar">Player 52387</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="varchar">Game 4</td>
<td class="varchar">Player 99741</td>
<td class="varchar">Player 99634</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="varchar">Game 5</td>
<td class="varchar">Player 90694</td>
<td class="varchar">Player 41009</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">37</td>
<td class="varchar">Game 37</td>
<td class="varchar">Player 63486</td>
<td class="varchar">Player 3322</td>
</tr>
<tr>
<td class="integer">38</td>
<td class="varchar">Game 38</td>
<td class="varchar">Player 78226, Player 82189</td>
<td class="varchar">Player 4761, Player 98354</td>
</tr>
<tr>
<td class="integer">39</td>
<td class="varchar">Game 39</td>
<td class="varchar">Player 34156</td>
<td class="varchar">Player 82799</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">48</td>
<td class="varchar">Game 48</td>
<td class="varchar">Player 66763</td>
<td class="varchar">Player 41569</td>
</tr>
<tr>
<td class="integer">49</td>
<td class="varchar">Game 49</td>
<td class="varchar">Player 85215</td>
<td class="varchar">Player 6814</td>
</tr>
<tr>
<td class="integer">50</td>
<td class="varchar">Game 50</td>
<td class="varchar">Player 25784</td>
<td class="varchar">Player 9361</td>
</tr>
<tr class="statusbar">
<td colspan="100">50 rows fetched in 0.0018s (0.0050s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">g</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">50</td>
<td class="double">1004398.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">gi</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.g.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">pt</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.gi.player2</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">gi</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.g.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">pt</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.gi.player1</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20101020_matches.g.id&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20101020_matches.g.id&#39; of SELECT #3 was resolved in SELECT #1
select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,(select group_concat(`20101020_matches`.`p`.`name` separator &#39;, &#39;) AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player1`) and (`20101020_matches`.`gi`.`type` = &#39;team&#39;))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when &#39;player&#39; then `20101020_matches`.`gi`.`player1` when &#39;team&#39; then `20101020_matches`.`pt`.`player` end)))) AS `(
        SELECT  GROUP_CONCAT(p.name SEPARATOR &#39;, &#39;) AS players
        FROM    game gi
        LEFT JOIN  
                player_team pt
        ON      gi.type = &#39;team&#39;
                AND pt.team = gi.player1
        JOIN    player p
        ON      `,(select group_concat(`20101020_matches`.`p`.`name` separator &#39;, &#39;) AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player2`) and (`20101020_matches`.`gi`.`type` = &#39;team&#39;))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when &#39;player&#39; then `20101020_matches`.`gi`.`player2` when &#39;team&#39; then `20101020_matches`.`pt`.`player` end)))) AS `(
        SELECT  GROUP_CONCAT(p.name SEPARATOR &#39;, &#39;) AS players
        FROM    game gi
        LEFT JOIN
                player_team pt
        ON      gi.type = &#39;team&#39;
                AND pt.team = gi.player2
        JOIN    player p
        ON      p.` from `20101020_matches`.`game` `g` order by `20101020_matches`.`g`.`id` limit 50
</pre>
<p>Instead of using correlated values <code>g.type</code> and <code>g.player*</code>, we join another instance of <code>game</code> (aliased as <code>gi</code>) on <code>g.id</code> and use its fields in the join instead.</p>
<p>This of course adds some overhead to the query, however, the <code>PRIMARY KEY</code> join on an already cached record is quite fast.</p>
<p>Let&#8217;s run two queries that select player lists for all games and compare the execution times.</p>
<p>First, the grouping query:</p>
<pre class="brush: sql">
SELECT  SUM(LENGTH(players))
FROM    (
        SELECT  g.id, g.name, GROUP_CONCAT(p.name SEPARATOR &#039;, &#039;) AS players
        FROM    game g
        LEFT JOIN
                player_team pt
        ON      g.type = &#039;team&#039;
                AND pt.team = g.player1
        JOIN    player p
        ON      p.id = CASE g.type WHEN &#039;player&#039; THEN g.player1 WHEN &#039;team&#039; THEN pt.player END
        GROUP BY
                g.id
        ORDER BY
                NULL
        ) q
</pre>
<p><a href="#" onclick="xcollapse('X4493');return false;"><strong>Show query details</strong></a><br />
</p>
<div id="X4493" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(players))</th>
</tr>
<tr>
<td class="decimal">6183688</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (8.3437s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">500000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">g</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">502199</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">pt</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.g.player1</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select sum(length(`q`.`players`)) AS `SUM(LENGTH(players))` from (select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,group_concat(`20101020_matches`.`p`.`name` separator &#39;, &#39;) AS `players` from `20101020_matches`.`game` `g` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`g`.`player1`) and (`20101020_matches`.`g`.`type` = &#39;team&#39;))) join `20101020_matches`.`player` `p` where (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`g`.`type` when &#39;player&#39; then `20101020_matches`.`g`.`player1` when &#39;team&#39; then `20101020_matches`.`pt`.`player` end)) group by `20101020_matches`.`g`.`id` order by NULL) `q`
</pre>
</div>
<p>And the subqueries:</p>
<pre class="brush: sql">
SELECT  g.id, g.name,
        SUM(LENGTH(
        (
        SELECT  GROUP_CONCAT(p.name SEPARATOR &#039;, &#039;) AS players
        FROM    game gi
        LEFT JOIN  
                player_team pt
        ON      gi.type = &#039;team&#039;
                AND pt.team = gi.player1
        JOIN    player p
        ON      p.id = CASE gi.type WHEN &#039;player&#039; THEN gi.player1 WHEN &#039;team&#039; THEN pt.player END
        WHERE   gi.id = g.id
        )
        )) AS len1, 
        SUM(LENGTH(
        (
        SELECT  GROUP_CONCAT(p.name SEPARATOR &#039;, &#039;) AS players
        FROM    game gi
        LEFT JOIN
                player_team pt
        ON      gi.type = &#039;team&#039;
                AND pt.team = gi.player2
        JOIN    player p
        ON      p.id = CASE gi.type WHEN &#039;player&#039; THEN gi.player2 WHEN &#039;team&#039; THEN pt.player END
        WHERE   gi.id = g.id
        )
        )) AS len2
FROM    game g
</pre>
<p><a href="#" onclick="xcollapse('X4844');return false;"><strong>Show query details</strong></a><br />
</p>
<div id="X4844" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>len1</th>
<th>len2</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Game 1</td>
<td class="decimal">6183688</td>
<td class="decimal">6184211</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (12.8749s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">g</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">502199</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">gi</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.g.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">pt</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.gi.player2</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">gi</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.g.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">pt</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20101020_matches.gi.player1</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20101020_matches.g.id&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20101020_matches.g.id&#39; of SELECT #3 was resolved in SELECT #1
select `20101020_matches`.`g`.`id` AS `id`,`20101020_matches`.`g`.`name` AS `name`,sum(length((select group_concat(`20101020_matches`.`p`.`name` separator &#39;, &#39;) AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player1`) and (`20101020_matches`.`gi`.`type` = &#39;team&#39;))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when &#39;player&#39; then `20101020_matches`.`gi`.`player1` when &#39;team&#39; then `20101020_matches`.`pt`.`player` end)))))) AS `len1`,sum(length((select group_concat(`20101020_matches`.`p`.`name` separator &#39;, &#39;) AS `players` from `20101020_matches`.`game` `gi` left join `20101020_matches`.`player_team` `pt` on(((`20101020_matches`.`pt`.`team` = `20101020_matches`.`gi`.`player2`) and (`20101020_matches`.`gi`.`type` = &#39;team&#39;))) join `20101020_matches`.`player` `p` where ((`20101020_matches`.`gi`.`id` = `20101020_matches`.`g`.`id`) and (`20101020_matches`.`p`.`id` = (case `20101020_matches`.`gi`.`type` when &#39;player&#39; then `20101020_matches`.`gi`.`player2` when &#39;team&#39; then `20101020_matches`.`pt`.`player` end)))))) AS `len2` from `20101020_matches`.`game` `g`
</pre>
</div>
<p>The second query is only <strong>40%</strong> longer, but it parses and returns twice as much data.</p>
<p>That means that it&#8217;s more efficient to run a single query to get both lists of players than to run a grouping query twice.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2010/10/20/correlated-left-join-in-a-subquery/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/10/20/correlated-left-join-in-a-subquery/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>OLAP in MySQL: four ways to filter on higher level dimensions</title>
		<link>http://explainextended.com/2010/09/30/olap-in-mysql-four-ways-to-filter-on-higher-level-dimensions/</link>
		<comments>http://explainextended.com/2010/09/30/olap-in-mysql-four-ways-to-filter-on-higher-level-dimensions/#comments</comments>
		<pubDate>Thu, 30 Sep 2010 19:00:44 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4915</guid>
		<description><![CDATA[Answering questions asked on the site. Denis Kuzmenok asks: I need some help with a query I&#8217;m fighting with. I have these tables: Product id parent name Site id region name Price id product site value For the products with a certain product.parent, I need to select minimal and maximal price within the given site.region, [...]]]></description>
				<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Denis Kuzmenok</strong> asks:</p>
<blockquote>
<p>I need some help with a query I&#8217;m fighting with.</p>
<p>I have these tables:</p>
<table class="excel">
<caption>Product</caption>
<tr>
<th>id</th>
<th>parent</th>
<th>name</th>
</tr>
</table>
<table class="excel">
<caption>Site</caption>
<tr>
<th>id</th>
<th>region</th>
<th>name</th>
</tr>
</table>
<table class="excel">
<caption>Price</caption>
<tr>
<th>id</th>
<th>product</th>
<th>site</th>
<th>value</th>
</tr>
</table>
<p>For the products with a certain <code>product.parent</code>, I need to select minimal and maximal price within the given <code>site.region</code>, and return first <strong>10</strong> products ordered by the minimal price.</p>
<p>Each <code>parent</code> has <strong>100</strong> to <strong>10,000</strong> products, and there are <strong>10,000,000</strong> records in <code>price</code> table.
</p></blockquote>
<p>We have a classical <strong>OLAP</strong> task here: a fact table (<code>price</code>) and two dimension tables (<code>product</code> and <code>site</code>).</p>
<p>The task would be almost trivial if we were given the exact values of <code>product</code> and <code>site</code>. This way, we could build a composite index on these fields and <code>value</code>, filter on the exact values and get the first <strong>10</strong> values from the index.</p>
<p>However, we are given not the values of the dimensions themselves but those of the higher levels (<code>parent</code> for <code>product</code> and <code>region</code> for <code>site</code>).</p>
<p>Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.</p>
<p>This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.</p>
<p>Since the only algorithm to make the joins <strong>MySQL</strong> is capable of is <em>nested loops</em>, basically, we need to define the order in which the tables would be joined.</p>
<p>There are three tables and, hence, <strong>3! = 6</strong> permutations that define all possible join orders:</p>
<table class="excel">
<tr>
<td>product</td>
<td>site</td>
<td>price</td>
</tr>
<tr>
<td>product</td>
<td>price</td>
<td>site</td>
</tr>
<tr>
<td>site</td>
<td>price</td>
<td>product</td>
</tr>
<tr>
<td>site</td>
<td>product</td>
<td>price</td>
</tr>
<tr>
<td>price</td>
<td>product</td>
<td>site</td>
</tr>
<tr>
<td>price</td>
<td>site</td>
<td>product</td>
</tr>
</table>
<p>However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:</p>
<table class="excel">
<tr>
<td colspan="2">product/site</td>
<td>price</td>
</tr>
<tr>
<td>product</td>
<td>price</td>
<td>site</td>
</tr>
<tr>
<td>site</td>
<td>price</td>
<td>product</td>
</tr>
<tr>
<td>price</td>
<td colspan="2">product/site</td>
</tr>
</table>
<p>The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.</p>
<p>Now, let&#8217;s build the queries for all types of the join. To do this, we will create sample tables:<br />
<span id="more-4915"></span><br />
<a href="#" onclick="xcollapse('X3970');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X3970" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE product (
        id INT NOT NULL PRIMARY KEY,
        parent INT NOT NULL,
        name TEXT NOT NULL
) ENGINE=InnoDB;

CREATE TABLE price (
        id INT NOT NULL PRIMARY KEY,
        product INT NOT NULL,
        site INT NOT NULL,
        value FLOAT NOT NULL
) ENGINE=InnoDB;

CREATE TABLE site (
       id INT NOT NULL PRIMARY KEY,
       region INT NOT NULL,
       name TEXT NOT NULL
) ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(250000);
COMMIT;

INSERT
INTO    product
SELECT  id, -FLOOR(LOG10(POWER(10, -10) + RAND(20100930) * (1 - POWER(10, -10))) * 500 / 10), CONCAT(&#039;Product &#039;, id)
FROM    filler
ORDER BY
        id
LIMIT 250000;

INSERT
INTO    site
SELECT  id, -FLOOR(LOG10(POWER(10, -10) + RAND(20100930) * (1 - POWER(10, -10))) * 100 / 10), CONCAT(&#039;Site &#039;, id)
FROM    filler
LIMIT 50000;

SET @r := 0;
INSERT
INTO    price
SELECT  @r := @r + 1,
        FLOOR(RAND(20100930 &lt;&lt; 2) * 250000) + 1,
        FLOOR(RAND(20100930 &lt;&lt; 3) * 50000) + 1,
        (RAND(20100930 &lt;&lt; 4) * 100) + 20
FROM    product
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        LIMIT 25
        ) q;

CREATE INDEX ix_product_parent ON product (parent);
CREATE INDEX ix_site_region ON site (region);
CREATE INDEX ix_price_site_product_value_id ON price (site, product, value, id);
CREATE INDEX ix_price_product_value_id ON price (product, value, id);
CREATE INDEX ix_price_value_id ON price (value, id);
</pre>
</div>
<p>Members of the dimensions are distributed logarithmically, reducing from the lower to the higher values.</p>
<p>Say, <strong>11,379</strong> products of <strong>250,000</strong> (<strong>4.55%</strong> of all products) belong to the parent <strong>1</strong>, while only <strong>2,906</strong> products (<strong>1.16%</strong>) belong to the parent <strong>30</strong>:</p>
<div class="terminal">
<table class="terminal">
<tr>
<th>parent</th>
<th>total</th>
<th>percent</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="bigint">11379</td>
<td class="decimal">4.55</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="bigint">10724</td>
<td class="decimal">4.29</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="bigint">10259</td>
<td class="decimal">4.10</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="bigint">9830</td>
<td class="decimal">3.93</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="bigint">9502</td>
<td class="decimal">3.80</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="bigint">8880</td>
<td class="decimal">3.55</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="bigint">8555</td>
<td class="decimal">3.42</td>
</tr>
<tr>
<td class="integer">8</td>
<td class="bigint">8072</td>
<td class="decimal">3.23</td>
</tr>
<tr>
<td class="integer">9</td>
<td class="bigint">7715</td>
<td class="decimal">3.09</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="bigint">7522</td>
<td class="decimal">3.01</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">252</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer">253</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer">263</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer">309</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer"></td>
<td class="bigint">250000</td>
<td class="decimal">100.00</td>
</tr>
</table>
</div>
<p>Same distribution applies to the sites:</p>
<div class="terminal">
<table class="terminal">
<tr>
<th>region</th>
<th>total</th>
<th>percent</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="bigint">10446</td>
<td class="decimal">20.89</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="bigint">8010</td>
<td class="decimal">16.02</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="bigint">6602</td>
<td class="decimal">13.20</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="bigint">5119</td>
<td class="decimal">10.24</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="bigint">4096</td>
<td class="decimal">8.19</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="bigint">3296</td>
<td class="decimal">6.59</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="bigint">2604</td>
<td class="decimal">5.21</td>
</tr>
<tr>
<td class="integer">8</td>
<td class="bigint">2006</td>
<td class="decimal">4.01</td>
</tr>
<tr>
<td class="integer">9</td>
<td class="bigint">1614</td>
<td class="decimal">3.23</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="bigint">1243</td>
<td class="decimal">2.49</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">43</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer">45</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer">46</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer">50</td>
<td class="bigint">1</td>
<td class="decimal">0.00</td>
</tr>
<tr>
<td class="integer"></td>
<td class="bigint">50000</td>
<td class="decimal">100.00</td>
</tr>
<tr class="statusbar">
<td colspan="100">47 rows fetched in 0.0014s (0.0709s)</td>
</tr>
</table>
</div>
<h3>Product/site, price</h3>
<p>This is the most straightforward query. Since the product and site table come first, it&#8217;s best for <strong>sparse</strong> products and <strong>sparse</strong> sites.</p>
<p>They will be filtered first and the results of the filtering cross joined (still returning relatively few values to join against the prices).</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  p.*, MIN(r.value) AS min_value
FROM    product p
CROSS JOIN
        site s
STRAIGHT_JOIN
        price r
ON      r.product = p.id
        AND r.site = s.id
WHERE   p.parent = 100
        AND s.region = 30
GROUP BY
        p.id
ORDER BY
        min_value
LIMIT 10

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>name</th>
<th>min_value</th>
</tr>
<tr>
<td class="integer">195818</td>
<td class="integer">100</td>
<td class="blob">Product 195818</td>
<td class="float">30.3642</td>
</tr>
<tr>
<td class="integer">54185</td>
<td class="integer">100</td>
<td class="blob">Product 54185</td>
<td class="float">58.377</td>
</tr>
<tr class="statusbar">
<td colspan="100">2 rows fetched in 0.0002s (0.0131s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">p</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY,ix_product_parent</td>
<td class="varchar">ix_product_parent</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">136</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">s</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY,ix_site_region</td>
<td class="varchar">ix_site_region</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">12</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">r</td>
<td class="varchar">ref</td>
<td class="varchar">ix_price_site_product_value_id,ix_price_product_value_id</td>
<td class="varchar">ix_price_site_product_value_id</td>
<td class="varchar">8</td>
<td class="varchar">20100930_prices.s.id,20100930_prices.p.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`product` `p` join `20100930_prices`.`site` `s` straight_join `20100930_prices`.`price` `r` where ((`20100930_prices`.`r`.`site` = `20100930_prices`.`s`.`id`) and (`20100930_prices`.`r`.`product` = `20100930_prices`.`p`.`id`) and (`20100930_prices`.`s`.`region` = 30) and (`20100930_prices`.`p`.`parent` = 100)) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10
</pre>
<p>We see that there is a <code>filesort</code> operation in the plan, but since there are relatively few values to sort, this operation is quite fast.</p>
<p>This query completes in <strong>13 ms</strong>.</p>
<h3>Site, product, price</h3>
<p>This query is quite similar to the previous one, differing only in the <code>JOIN</code> order. It is best for <strong>sparse</strong> sites and <strong>dense</strong> products:</p>
<pre class="brush: sql">
SELECT  p.*, MIN(r.value) AS min_value
FROM    site s
STRAIGHT_JOIN
        price r
ON      r.site = s.id
STRAIGHT_JOIN
        product p
ON      p.parent = 1
        AND p.id = r.product
WHERE   s.region = 30
GROUP BY
        p.id
ORDER BY
        min_value
LIMIT 10

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>name</th>
<th>min_value</th>
</tr>
<tr>
<td class="integer">82678</td>
<td class="integer">1</td>
<td class="blob">Product 82678</td>
<td class="float">22.7273</td>
</tr>
<tr>
<td class="integer">35757</td>
<td class="integer">1</td>
<td class="blob">Product 35757</td>
<td class="float">23.1834</td>
</tr>
<tr>
<td class="integer">244754</td>
<td class="integer">1</td>
<td class="blob">Product 244754</td>
<td class="float">23.5035</td>
</tr>
<tr>
<td class="integer">217731</td>
<td class="integer">1</td>
<td class="blob">Product 217731</td>
<td class="float">24.3615</td>
</tr>
<tr>
<td class="integer">99861</td>
<td class="integer">1</td>
<td class="blob">Product 99861</td>
<td class="float">25.5681</td>
</tr>
<tr>
<td class="integer">127003</td>
<td class="integer">1</td>
<td class="blob">Product 127003</td>
<td class="float">26.8058</td>
</tr>
<tr>
<td class="integer">69848</td>
<td class="integer">1</td>
<td class="blob">Product 69848</td>
<td class="float">26.9732</td>
</tr>
<tr>
<td class="integer">128191</td>
<td class="integer">1</td>
<td class="blob">Product 128191</td>
<td class="float">32.3508</td>
</tr>
<tr>
<td class="integer">99234</td>
<td class="integer">1</td>
<td class="blob">Product 99234</td>
<td class="float">32.9937</td>
</tr>
<tr>
<td class="integer">242035</td>
<td class="integer">1</td>
<td class="blob">Product 242035</td>
<td class="float">33.9945</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.0300s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">s</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY,ix_site_region</td>
<td class="varchar">ix_site_region</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">12</td>
<td class="double">100.00</td>
<td class="varchar">Using index; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">r</td>
<td class="varchar">ref</td>
<td class="varchar">ix_price_site_product_value_id,ix_price_product_value_id</td>
<td class="varchar">ix_price_site_product_value_id</td>
<td class="varchar">4</td>
<td class="varchar">20100930_prices.s.id</td>
<td class="bigint">54</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ix_product_parent</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20100930_prices.r.product</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`site` `s` straight_join `20100930_prices`.`price` `r` straight_join `20100930_prices`.`product` `p` where ((`20100930_prices`.`r`.`site` = `20100930_prices`.`s`.`id`) and (`20100930_prices`.`p`.`id` = `20100930_prices`.`r`.`product`) and (`20100930_prices`.`p`.`parent` = 1) and (`20100930_prices`.`s`.`region` = 30)) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10
</pre>
<p>This query completes in <strong>30 ms</strong>, since the conditions (in their totality) are less selective.</p>
<h3>Product, price, site</h3>
<p>This query is best for <strong>sparse</strong> products and <strong>dense</strong> sites:</p>
<p>In this query, we will replace the <code>JOIN</code> against the site table with an <code>IN</code> predicate, since we don&#8217;t actually need any information from that table:</p>
<pre class="brush: sql">
SELECT  p.*, MIN(r.value) AS min_value
FROM    product p
STRAIGHT_JOIN
        price r
ON      r.product = p.id
WHERE   p.parent = 100
        AND r.site IN
        (
        SELECT  id
        FROM    site
        WHERE   region = 1
        )
GROUP BY
        p.id
ORDER BY
        min_value
LIMIT 10
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>name</th>
<th>min_value</th>
</tr>
<tr>
<td class="integer">44246</td>
<td class="integer">100</td>
<td class="blob">Product 44246</td>
<td class="float">20.4741</td>
</tr>
<tr>
<td class="integer">54185</td>
<td class="integer">100</td>
<td class="blob">Product 54185</td>
<td class="float">20.5249</td>
</tr>
<tr>
<td class="integer">241517</td>
<td class="integer">100</td>
<td class="blob">Product 241517</td>
<td class="float">20.7065</td>
</tr>
<tr>
<td class="integer">131185</td>
<td class="integer">100</td>
<td class="blob">Product 131185</td>
<td class="float">21.1123</td>
</tr>
<tr>
<td class="integer">61725</td>
<td class="integer">100</td>
<td class="blob">Product 61725</td>
<td class="float">21.1389</td>
</tr>
<tr>
<td class="integer">80285</td>
<td class="integer">100</td>
<td class="blob">Product 80285</td>
<td class="float">21.3354</td>
</tr>
<tr>
<td class="integer">22533</td>
<td class="integer">100</td>
<td class="blob">Product 22533</td>
<td class="float">21.4976</td>
</tr>
<tr>
<td class="integer">226588</td>
<td class="integer">100</td>
<td class="blob">Product 226588</td>
<td class="float">21.5061</td>
</tr>
<tr>
<td class="integer">114991</td>
<td class="integer">100</td>
<td class="blob">Product 114991</td>
<td class="float">21.5185</td>
</tr>
<tr>
<td class="integer">5571</td>
<td class="integer">100</td>
<td class="blob">Product 5571</td>
<td class="float">21.7684</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.0302s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY,ix_product_parent</td>
<td class="varchar">ix_product_parent</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">136</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">r</td>
<td class="varchar">ref</td>
<td class="varchar">ix_price_product_value_id</td>
<td class="varchar">ix_price_product_value_id</td>
<td class="varchar">4</td>
<td class="varchar">20100930_prices.p.id</td>
<td class="bigint">12</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">site</td>
<td class="varchar">unique_subquery</td>
<td class="varchar">PRIMARY,ix_site_region</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,min(`20100930_prices`.`r`.`value`) AS `min_value` from `20100930_prices`.`product` `p` straight_join `20100930_prices`.`price` `r` where ((`20100930_prices`.`r`.`product` = `20100930_prices`.`p`.`id`) and (`20100930_prices`.`p`.`parent` = 100) and &lt;in_optimizer&gt;(`20100930_prices`.`r`.`site`,&lt;exists&gt;(&lt;primary_index_lookup&gt;(&lt;cache&gt;(`20100930_prices`.`r`.`site`) in site on PRIMARY where ((`20100930_prices`.`site`.`region` = 1) and (&lt;cache&gt;(`20100930_prices`.`r`.`site`) = `20100930_prices`.`site`.`id`)))))) group by `20100930_prices`.`p`.`id` order by min(`20100930_prices`.`r`.`value`) limit 10
</pre>
<p>Again, this query is <strong>30 ms</strong>.</p>
<h3>Price, product/site</h3>
<p>This query is best for <strong>dense</strong> products and <strong>dense</strong> sites. This is the most complex query which would require some explanation.</p>
<p>At the first sight, it may seem that <code>price</code> cannot be filtered on. However, it&#8217;s not true.</p>
<p>What we need is <strong>10</strong> lowest values from <code>price</code> satisfying certain conditions (right products and sites, and, additionally, products should be unique).</p>
<p>This means that we could make use of an index on <code>price.value</code>, scanning it until 10 records satisfying these conditions are returned. In this case, <code>ORDER BY</code> and <code>LIMIT 10</code> would serve as filtering conditions: the scanning would cease as soon as the limit is reached.</p>
<p>And of course the more dense are the products and sites, the more is the probability for the conditions to be satisfied, the sooner the query completes.</p>
<p>Filtering on products and sites is easy, but to ensure that the products are unique, we will use the trick described in my previous article:</p>
<ul>
<li><strong><a href="/2010/08/24/20-latest-unique-records/">20 latest unique records</a></strong></li>
</ul>
<p>Basically, when filtering a price, we should not only make sure that it holds the correct product, but that it&#8217;s the first (lowest) price for a given product.</p>
<p>This can be easily done with an additional <code>ORDER BY / LIMIT</code> in a subquery.</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  p.*, value AS min_value
FROM    price r
STRAIGHT_JOIN
        product p
ON      p.id = r.product
WHERE   p.parent = 1
        AND r.id =
        (
        SELECT  ri.id
        FROM    price ri
        WHERE   ri.product = p.id
                AND ri.site IN
                (
                SELECT  id
                FROM    site
                WHERE   region = 1
                )
        ORDER BY
                ri.product, ri.value, ri.id
        LIMIT 1
        )
ORDER BY
        r.value, r.id
LIMIT 10
        
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>name</th>
<th>min_value</th>
</tr>
<tr>
<td class="integer">80226</td>
<td class="integer">1</td>
<td class="blob">Product 80226</td>
<td class="float">20.0021</td>
</tr>
<tr>
<td class="integer">168980</td>
<td class="integer">1</td>
<td class="blob">Product 168980</td>
<td class="float">20.0028</td>
</tr>
<tr>
<td class="integer">185520</td>
<td class="integer">1</td>
<td class="blob">Product 185520</td>
<td class="float">20.0041</td>
</tr>
<tr>
<td class="integer">131416</td>
<td class="integer">1</td>
<td class="blob">Product 131416</td>
<td class="float">20.0054</td>
</tr>
<tr>
<td class="integer">72154</td>
<td class="integer">1</td>
<td class="blob">Product 72154</td>
<td class="float">20.0055</td>
</tr>
<tr>
<td class="integer">76637</td>
<td class="integer">1</td>
<td class="blob">Product 76637</td>
<td class="float">20.0067</td>
</tr>
<tr>
<td class="integer">95946</td>
<td class="integer">1</td>
<td class="blob">Product 95946</td>
<td class="float">20.0091</td>
</tr>
<tr>
<td class="integer">28136</td>
<td class="integer">1</td>
<td class="blob">Product 28136</td>
<td class="float">20.0102</td>
</tr>
<tr>
<td class="integer">21426</td>
<td class="integer">1</td>
<td class="blob">Product 21426</td>
<td class="float">20.0142</td>
</tr>
<tr>
<td class="integer">24976</td>
<td class="integer">1</td>
<td class="blob">Product 24976</td>
<td class="float">20.0145</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.0144s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">r</td>
<td class="varchar">index</td>
<td class="varchar">PRIMARY,ix_price_product_value_id</td>
<td class="varchar">ix_price_value_id</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">62505352.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ix_product_parent</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20100930_prices.r.product</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ri</td>
<td class="varchar">ref</td>
<td class="varchar">ix_price_product_value_id</td>
<td class="varchar">ix_price_product_value_id</td>
<td class="varchar">4</td>
<td class="varchar">20100930_prices.p.id</td>
<td class="bigint">12</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">site</td>
<td class="varchar">unique_subquery</td>
<td class="varchar">PRIMARY,ix_site_region</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100930_prices.p.id&#39; of SELECT #2 was resolved in SELECT #1
select `20100930_prices`.`p`.`id` AS `id`,`20100930_prices`.`p`.`parent` AS `parent`,`20100930_prices`.`p`.`name` AS `name`,`20100930_prices`.`r`.`value` AS `min_value` from `20100930_prices`.`price` `r` straight_join `20100930_prices`.`product` `p` where ((`20100930_prices`.`p`.`id` = `20100930_prices`.`r`.`product`) and (`20100930_prices`.`p`.`parent` = 1) and (`20100930_prices`.`r`.`id` = (select `20100930_prices`.`ri`.`id` from `20100930_prices`.`price` `ri` where ((`20100930_prices`.`ri`.`product` = `20100930_prices`.`p`.`id`) and &lt;in_optimizer&gt;(`20100930_prices`.`ri`.`site`,&lt;exists&gt;(&lt;primary_index_lookup&gt;(&lt;cache&gt;(`20100930_prices`.`ri`.`site`) in site on PRIMARY where ((`20100930_prices`.`site`.`region` = 1) and (&lt;cache&gt;(`20100930_prices`.`ri`.`site`) = `20100930_prices`.`site`.`id`)))))) order by `20100930_prices`.`ri`.`product`,`20100930_prices`.`ri`.`value`,`20100930_prices`.`ri`.`id` limit 1))) order by `20100930_prices`.`r`.`value`,`20100930_prices`.`r`.`id` limit 10
</pre>
<p>We see that this query completes in only <strong>14 ms</strong> despite the fact that we used the most populated <code>parent</code> and <code>region</code>. As I said before, the <code>LIMIT 10</code> served as another filtering condition.</p>
<h3>Summary</h3>
<p>Let&#8217;s run all possible queries for all possible distributions and record the query times in a table</p>
<p><strong>P</strong>, <strong>S</strong> and <strong>R</strong> stand for <strong>p</strong>roduct, <strong>s</strong>ite and p<strong>r</strong>ice in the join order.</p>
<p><strong>+Inf</strong> means that the query did not complete in a reasonable time and had to be terminated.</p>
<table class="excel">
<caption>Query time, <strong>ms</strong></caption>
<tr style="text-align: center">
<td colspan="2" rowspan="3"/>
<td colspan="8"><strong>Product</strong></td>
</tr>
<tr style="text-align: center">
<td colspan="4">Sparse (parent <strong>100</strong>)</td>
<td colspan="4">Dense (parent <strong>1</strong>)</td>
</tr>
<tr>
<th>P/S,R</th>
<th>S,R,P</th>
<th>P,R,S</th>
<th>R,P/S</th>
<th>P/S,R</th>
<th>S,R,P</th>
<th>P,R,S</th>
<th>R,P/S</th>
</tr>
<tr style="text-align: right">
<td rowspan="2" style="text-align: center"><strong>Site</strong></td>
<td style="text-align: center">Sparse (region <strong>30</strong>)</td>
<td style="border-style: none; color:red; ">13</td>
<td style="border-style: none;">27</td>
<td style="border-style: none;">30</td>
<td style="border-style: none solid none none;"><em>+Inf</em></td>
<td style="border-style: none;">621</td>
<td style="border-style: none; color:red; ">30</td>
<td style="border-style: none;">6,703</td>
<td style="border-style: none;">51,468</td>
</tr>
<tr style="text-align: right">
<td style="text-align: center">Dense (region <strong>1</strong>)</td>
<td style="border-style: none;">7,679</td>
<td style="border-style: none;">4,427</td>
<td style="border-style: none; color:red; ">30</td>
<td style="border-style: none solid none none;">1,051</td>
<td style="border-style: none;"><em>+Inf</em></td>
<td style="border-style: none;">5,914</td>
<td style="border-style: none;">6,736</td>
<td style="border-style: none; color:red; ">14</td>
</tr>
</table>
<h3>Conclusion</h3>
<p>If limiting the results, <strong>MySQL</strong> allows filtering fact tables efficiently on higher level dimensions despite the fact these dimensions cannot be indexed. However, in each case the selectivity of the dimension level should be taken into account and the appropriate query should be used.</p>
<p>It should be noted that the queries involving a <code>CROSS JOIN</code> and index scan on the fact table perform intolerably poorly in the edge cases (too dense and too sparse dimensions, accordingly). On the other hand, the queries involving a join only differ in the join order which can be predicted by <strong>MySQL</strong>.</p>
<p>This means that when the dimension selectivity is unknown, a query using a plain join of all three tables (without forcing the join order) should be the query of choice:</p>
<pre class="brush: sql">
SELECT  p.*, MIN(r.value) AS min_value
FROM    product p
CROSS JOIN
        site s
JOIN    price r
ON      r.product = p.id
        AND r.site = s.id
WHERE   p.parent = 1
        AND s.region = 1
GROUP BY
        p.id
ORDER BY
        min_value
LIMIT 10
</pre>
<p>, since <strong>MySQL</strong> can select a join order quite close to optimal and the query would complete in reasonable time.</p>
<p>However, when the distribution on both dimensions is known, the appropriate query from the table above can be chosen.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2010/09/30/olap-in-mysql-four-ways-to-filter-on-higher-level-dimensions/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/09/30/olap-in-mysql-four-ways-to-filter-on-higher-level-dimensions/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>20 latest unique records</title>
		<link>http://explainextended.com/2010/08/24/20-latest-unique-records/</link>
		<comments>http://explainextended.com/2010/08/24/20-latest-unique-records/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 19:00:38 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4884</guid>
		<description><![CDATA[From Stack Overflow: I have a logfile which logs the insert/delete/updates from all kinds of tables. I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (datetime DESC) A common solution for such a task would be writing an aggregate query with [...]]]></description>
				<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/3555118/mysql-select-20-latest-entries-in-logfile-from-unique-persons"><strong>Stack Overflow</strong></a>:</p>
<blockquote>
<p>I have a logfile which logs the insert/delete/updates from all kinds of tables.</p>
<p>I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (<code>datetime DESC</code>)</p>
</blockquote>
<p>A common solution for such a task would be writing an aggregate query with <code>ORDER BY</code> and <code>LIMIT</code>:</p>
<pre class="brush: sql">
SELECT  person, MAX(ts) AS last_update
FROM    logfile
GROUP BY
        person
ORDER BY
        last_update DESC
LIMIT 20
</pre>
<p>What&#8217;s bad in this solution? Performance, as usual.</p>
<p>Since <code>last_update</code> is an aggregate, it cannot be indexed. And <code>ORDER BY</code> on unindexed fields results in our good old friend, <code>filesort</code>.</p>
<p>Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on <code>(person, ts)</code>, <code>MySQL</code> will tend to use a <a href="http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html">loose index scan</a> on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by <code>MySQL</code>.</p>
<p>We should use another approach here. Let&#8217;s create a sample table and test this approach:<br />
<span id="more-4884"></span><br />
<a href="#" onclick="xcollapse('X8484');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X8484" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE logfile (
        id INT NOT NULL PRIMARY KEY,
        sparse INT NOT NULL,
        dense INT NOT NULL,
        ts DATETIME NOT NULL,
        stuffing VARCHAR(100) NOT NULL,
        KEY ix_logfile_ts_id (ts, id),
        KEY ix_logfile_sparse_ts_id (sparse, ts, id),
        KEY ix_logfile_dense_ts_id (dense, ts, id)
) ENGINE=InnoDB;
       
DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    logfile
SELECT  id,
        CEILING(RAND(20100824) * 30),
        CEILING(RAND(20100824 &lt;&lt; 1) * 30000),
        &#039;2010-08-24&#039; - INTERVAL RAND(20100824 &lt;&lt; 2) * 10000000 SECOND,
        LPAD(&#039;&#039;, 100, &#039;*&#039;)
FROM    filler;
</pre>
</div>
<p>This table has <strong>1,000,000</strong> records.</p>
<p>Instead of a single field, <code>person</code>, I created two different fields: <code>sparse</code> and <code>dense</code>. The first one has <strong>30</strong> distinct values, while the second one has <strong>30,000</strong>. This will help us to see how data distribution affects performance of different queries.</p>
<p>Let&#8217;s run our original queries. We&#8217;ll adjust them a little to help <code>MySQL</code> to pick correct plans:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  sparse, MAX(ts) AS last_update
        FROM    logfile
        GROUP BY
                sparse
        ) q
ORDER BY
        last_update DESC
LIMIT 20;
</pre>
<p><a href="#" onclick="xcollapse('X4382');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X4382" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>sparse</th>
<th>last_update</th>
</tr>
<tr>
<td class="integer">15</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">26</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">11</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">30</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">29</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">13</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">27</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">12</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">14</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">24</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">17</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">23</td>
<td class="timestamp">2010-08-23 23:55:08</td>
</tr>
<tr>
<td class="integer">19</td>
<td class="timestamp">2010-08-23 23:55:07</td>
</tr>
<tr>
<td class="integer">20</td>
<td class="timestamp">2010-08-23 23:53:44</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="timestamp">2010-08-23 23:51:52</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="timestamp">2010-08-23 23:50:53</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0005s (0.0026s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">500133</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `q`.`sparse` AS `sparse`,`q`.`last_update` AS `last_update` from (select `20100824_latest`.`logfile`.`sparse` AS `sparse`,max(`20100824_latest`.`logfile`.`ts`) AS `last_update` from `20100824_latest`.`logfile` group by `20100824_latest`.`logfile`.`sparse`) `q` order by `q`.`last_update` desc limit 20
</pre>
</div>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  dense, MAX(ts) AS last_update
        FROM    logfile
        GROUP BY
                dense
        ) q
ORDER BY
        last_update DESC
LIMIT 20;
</pre>
<p><a href="#" onclick="xcollapse('X2066');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X2066" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>dense</th>
<th>last_update</th>
</tr>
<tr>
<td class="integer">25324</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">13060</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">3268</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">2327</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">23968</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">1622</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">29693</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">655</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">5802</td>
<td class="timestamp">2010-08-23 23:58:07</td>
</tr>
<tr>
<td class="integer">11843</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">18894</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">6180</td>
<td class="timestamp">2010-08-23 23:57:26</td>
</tr>
<tr>
<td class="integer">9398</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">18012</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">25758</td>
<td class="timestamp">2010-08-23 23:56:49</td>
</tr>
<tr>
<td class="integer">2379</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">821</td>
<td class="timestamp">2010-08-23 23:56:39</td>
</tr>
<tr>
<td class="integer">4186</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">20198</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">18615</td>
<td class="timestamp">2010-08-23 23:56:01</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0005s (0.5000s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30000</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">500133</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `q`.`dense` AS `dense`,`q`.`last_update` AS `last_update` from (select `20100824_latest`.`logfile`.`dense` AS `dense`,max(`20100824_latest`.`logfile`.`ts`) AS `last_update` from `20100824_latest`.`logfile` group by `20100824_latest`.`logfile`.`dense`) `q` order by `q`.`last_update` desc limit 20
</pre>
</div>
<p>We see that both queries use the same plan and return <strong>20</strong> records,  but the first one is instant, while the second one runs for <strong>500 ms</strong>. Both queries use <strong>filesort</strong>, but in second case it has to sort <strong>30,000</strong> records (compared to <strong>30</strong> in the first case).</p>
<p>In this case, it is better to use another approach.</p>
<p>With our original query, we take each person and see which record is latest for this person. But we can as well do it the other way round: take the records in descending order, one by one, and for each record see if it&#8217;s latest for this person. If it is, we should return it; if it&#8217;s not, this means that the record for this person has already been returned (remember, we take them in descending order).</p>
<p>It&#8217;s easy to see that <strong>20</strong> records returned this way will, first, belong to <strong>20</strong> different people, and, second, be the latest records of their respective persons. This is exactly what we need.</p>
<p>The records can easily be scanned in the descending order using the index on <code>(ts, id)</code>. But how do we check if the record is the latest? It&#8217;s simple: we just take the last record for the given person from the index on <code>(person, ts, id)</code> and compare its <code>id</code>. It takes but a single index seek per record and is almost instant.</p>
<p>Here&#8217;s the query to do it:</p>
<pre class="brush: sql">
SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id = 
        (
        SELECT  id
        FROM    logfile lfi
        WHERE   lfi.sparse = lf.sparse
        ORDER BY
                sparse DESC, ts DESC, id DESC
        LIMIT 1
        )
ORDER BY
        ts DESC, id DESC
LIMIT 20
        
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>sparse</th>
<th>dense</th>
<th>ts</th>
</tr>
<tr>
<td class="integer">121946</td>
<td class="integer">15</td>
<td class="integer">25324</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">276499</td>
<td class="integer">11</td>
<td class="integer">3268</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">62419</td>
<td class="integer">26</td>
<td class="integer">13060</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">254750</td>
<td class="integer">30</td>
<td class="integer">2327</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">96079</td>
<td class="integer">29</td>
<td class="integer">23968</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">290657</td>
<td class="integer">13</td>
<td class="integer">1622</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">278842</td>
<td class="integer">27</td>
<td class="integer">29693</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">329318</td>
<td class="integer">7</td>
<td class="integer">655</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">384956</td>
<td class="integer">5</td>
<td class="integer">11843</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">386333</td>
<td class="integer">12</td>
<td class="integer">18894</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">260404</td>
<td class="integer">14</td>
<td class="integer">9398</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">471000</td>
<td class="integer">6</td>
<td class="integer">18012</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">172079</td>
<td class="integer">2</td>
<td class="integer">2379</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">112653</td>
<td class="integer">24</td>
<td class="integer">4186</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">291683</td>
<td class="integer">17</td>
<td class="integer">20198</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">144673</td>
<td class="integer">23</td>
<td class="integer">25055</td>
<td class="timestamp">2010-08-23 23:55:08</td>
</tr>
<tr>
<td class="integer">172118</td>
<td class="integer">19</td>
<td class="integer">29039</td>
<td class="timestamp">2010-08-23 23:55:07</td>
</tr>
<tr>
<td class="integer">198913</td>
<td class="integer">20</td>
<td class="integer">9887</td>
<td class="timestamp">2010-08-23 23:53:44</td>
</tr>
<tr>
<td class="integer">491436</td>
<td class="integer">10</td>
<td class="integer">17752</td>
<td class="timestamp">2010-08-23 23:51:52</td>
</tr>
<tr>
<td class="integer">346651</td>
<td class="integer">4</td>
<td class="integer">10951</td>
<td class="timestamp">2010-08-23 23:50:53</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0007s (0.0034s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">lf</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_ts_id</td>
<td class="varchar">12</td>
<td class="varchar"></td>
<td class="bigint">20</td>
<td class="double">2500660.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">lfi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">4</td>
<td class="varchar">20100824_latest.lf.sparse</td>
<td class="bigint">27785</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100824_latest.lf.sparse&#39; of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`sparse` = `20100824_latest`.`lf`.`sparse`) order by `20100824_latest`.`lfi`.`sparse` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20
</pre>
<p>As we can see, this query uses two different indexes. The first one on <code>(ts, id)</code> is used to scan all records according to the overall timeline; the second one, on <code>(sparse, ts, id)</code> is used to find the <code>id</code> of the latest entry for a person and check if it&#8217;s the same as the record selected from the general timeline.</p>
<p>The query is instant: <strong>3 ms</strong>.</p>
<p>Let&#8217;s check the same query on a column with lots of values:</p>
<pre class="brush: sql">
SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id = 
        (
        SELECT  id
        FROM    logfile lfi
        WHERE   lfi.dense = lf.dense
        ORDER BY
                dense DESC, ts DESC, id DESC
        LIMIT 1
        )
ORDER BY
        ts DESC, id DESC
LIMIT 20

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>sparse</th>
<th>dense</th>
<th>ts</th>
</tr>
<tr>
<td class="integer">121946</td>
<td class="integer">15</td>
<td class="integer">25324</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">276499</td>
<td class="integer">11</td>
<td class="integer">3268</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">62419</td>
<td class="integer">26</td>
<td class="integer">13060</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">254750</td>
<td class="integer">30</td>
<td class="integer">2327</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">96079</td>
<td class="integer">29</td>
<td class="integer">23968</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">290657</td>
<td class="integer">13</td>
<td class="integer">1622</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">278842</td>
<td class="integer">27</td>
<td class="integer">29693</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">329318</td>
<td class="integer">7</td>
<td class="integer">655</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">277612</td>
<td class="integer">15</td>
<td class="integer">5802</td>
<td class="timestamp">2010-08-23 23:58:07</td>
</tr>
<tr>
<td class="integer">384956</td>
<td class="integer">5</td>
<td class="integer">11843</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">386333</td>
<td class="integer">12</td>
<td class="integer">18894</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">201899</td>
<td class="integer">7</td>
<td class="integer">6180</td>
<td class="timestamp">2010-08-23 23:57:26</td>
</tr>
<tr>
<td class="integer">260404</td>
<td class="integer">14</td>
<td class="integer">9398</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">471000</td>
<td class="integer">6</td>
<td class="integer">18012</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">451808</td>
<td class="integer">26</td>
<td class="integer">25758</td>
<td class="timestamp">2010-08-23 23:56:49</td>
</tr>
<tr>
<td class="integer">172079</td>
<td class="integer">2</td>
<td class="integer">2379</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">367042</td>
<td class="integer">11</td>
<td class="integer">821</td>
<td class="timestamp">2010-08-23 23:56:39</td>
</tr>
<tr>
<td class="integer">112653</td>
<td class="integer">24</td>
<td class="integer">4186</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">291683</td>
<td class="integer">17</td>
<td class="integer">20198</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">127839</td>
<td class="integer">11</td>
<td class="integer">18615</td>
<td class="timestamp">2010-08-23 23:56:01</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0007s (0.0031s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">lf</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_ts_id</td>
<td class="varchar">12</td>
<td class="varchar"></td>
<td class="bigint">20</td>
<td class="double">2500660.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">lfi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">4</td>
<td class="varchar">20100824_latest.lf.dense</td>
<td class="bigint">8</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100824_latest.lf.dense&#39; of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`dense` = `20100824_latest`.`lf`.`dense`) order by `20100824_latest`.`lfi`.`dense` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20
</pre>
<p>We see that the query is instant again, despite the data distribution being completely different. This is because the query only skips the records which are not the latest of their persons, and the total number of the records to scan is defined by how many records do we browse before we encounter the <strong>20th</strong> unique value in our scan. This value decreases exponentially as the number of distinct persons in the table grows, but with <strong>99%</strong> probability it won&#8217;t exceed <strong>100</strong> records even for only <strong>20</strong> distinct persons in the table.</p>
<p>The only problem that can arise here is that the number of distinct persons in the table is <em>less</em> than the <code>LIMIT</code> we set. In this case, no new records after the limit is reached can be returned, and a full index scan (accompanied by an index seek once per record) will ultimately be performed.</p>
<p>To work around this, the following simple query can be run in advance:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT sparse
        FROM    logfile
        LIMIT 20
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">20</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0015s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X1238');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1238" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by; Using temporary</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from (select distinct `20100824_latest`.`logfile`.`sparse` AS `sparse` from `20100824_latest`.`logfile` limit 20) `q`
</pre>
</div>
<p>This query will return the actual number of distinct persons in the table if there are less than <strong>20</strong> (or <strong>20</strong> if these are more).</p>
<p>This query is instant even for the dense data:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT dense
        FROM    logfile
        LIMIT 20
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">20</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0024s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X1940');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1940" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">16</td>
<td class="varchar"></td>
<td class="bigint">500132</td>
<td class="double">12.50</td>
<td class="varchar">Using index; Using temporary</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from (select distinct `20100824_latest`.`logfile`.`dense` AS `dense` from `20100824_latest`.`logfile` limit 20) `q`
</pre>
</div>
<p>This needs to be run as a separate query because <strong>MySQL</strong> does not allow using anything other than constants in the <code>LIMIT</code> clause. The result of this query should be substituted into the <code>LIMIT</code> clause on the client or in a dynamically composed query on the server.</p>
<h3>Summary</h3>
<p>To select a number of latest unique records from a table, one can use aggregate functions, however, this can decrease the query performance.</p>
<p>This can be done more efficiently by creating two different indexes on the table and checking the records taken from the general timeline against the end of the index on the person&#8217;s timeline.</p>
<p>To avoid performance degradation in marginal cases (when the total number of persons in the table is less than <code>LIMIT</code>), it is possible to make an additional check for the total number of distinct records and adjust the <code>LIMIT</code> clause if there are not enough records.</p>
<p><strong>P. S.</strong> I decided to enable comments for the technical posts as well. You are welcome to comment.</p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2010/08/24/20-latest-unique-records/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/08/24/20-latest-unique-records/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Indexing for ORDER BY / LIMIT</title>
		<link>http://explainextended.com/2010/06/30/indexing-for-order-by-limit/</link>
		<comments>http://explainextended.com/2010/06/30/indexing-for-order-by-limit/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 19:00:34 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4831</guid>
		<description><![CDATA[Answering questions asked on the site. Frode Underhill asks: I have some applications that are logging to a MySQL database table. The table is pretty standard on the form: timeBIGINT(20) sourceTINYINT(4) severityENUM textVARCHAR(255) , where source identifies the system that generated the log entry. There are very many entries in the table (>100 million), of [...]]]></description>
				<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Frode Underhill</strong> asks:</p>
<blockquote><p>I have some applications that are logging to a <strong>MySQL</strong> database table.</p>
<p>The table is pretty standard on the form:</p>
<table class="excel">
<tr>
<th>time<br/><code>BIGINT(20)</code></th>
<th>source<br/><code>TINYINT(4)</code></th>
<th>severity<br/><code>ENUM</code></th>
<th>text<br/><code>VARCHAR(255)</code></th>
</tr>
</table>
<p>, where <code>source</code> identifies the system that generated the log entry.</p>
<p>There are very many entries in the table (<strong>>100 million</strong>), of which <strong>99.9999%</strong> are debug or info messages.</p>
<p>I&#8217;m making an interface for browsing this log, which means I&#8217;ll be doing queries like</p>
<pre class="brush: sql">
SELECT  *
FROM    log 
WHERE   source = 2
        AND severity IN (1,2) 
        AND time &gt; 12345 
ORDER BY
        time ASC
LIMIT 30
</pre>
<p><!-- --></p>
<p>, if I want to find debug or info log entries from a certain point in time, or </p>
<pre class="brush: sql">
SELECT  *
FROM    log 
WHERE   source = 2
        AND severity IN (1,2) 
        AND time &lt; 12345 
ORDER BY
        time DESC
LIMIT 30
</pre>
<p><!-- --></p>
<p>for finding entries right before a certain time.</p>
<p>How would one go about indexing &#038; querying such a table?</p>
<p>I thought I had it figured out (I pretty much just tried every different combination of columns in an index), but there&#8217;s always some set of parameters that results in a really slow query.
</p></blockquote>
<p>The problem is that you cannot use a single index both for filtering and ordering if you have a ranged condition (<code>severity IN (1, 2)</code> in this case).</p>
<p>Recently I wrote an article with a proposal to improve <strong>SQL</strong> optimizer to handle these conditions. If a range has low cardinality (this is, there are few values that con possibly satisfy the range), then the query could be improved by rewriting the range as a series of individual queries, each one using one of the values constituting the range in an equijoin:</p>
<ul>
<li><a href="/2010/05/19/things-sql-needs-determining-range-cardinality/"><strong>Things SQL needs: determining range cardinality</strong></a></li>
</ul>
<p>No optimizers can handle this condition automatically yet, so we&#8217;ll need to emulate it.</p>
<p>Since the <code>severity</code> field is defined as an <code>enum</code> with only <strong>5</strong> values possible, any range condition on this field can be satisfied by no more than <strong>5</strong> distinct values, thus making this table ideal for rewriting the query.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-4831"></span><br />
<a href="#" onclick="xcollapse('X1733');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X1733" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_log (
        id INT NOT NULL,
        ts BIGINT NOT NULL,
        source TINYINT(4) NOT NULL,
        severity ENUM(&#039;DEBUG&#039;,&#039;INFO&#039;,&#039;WARNING&#039;,&#039;ERROR&#039;,&#039;FATAL&#039;) NOT NULL,
        tx VARCHAR(255)
) ENGINE=MyISAM;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(3500);
COMMIT;

INSERT
INTO    t_log
SELECT  (f1.id - 1) * 3000 + f2.id,
        UNIX_TIMESTAMP(&#039;2010-06-29&#039; - INTERVAL (f1.id - 1) * 3000 + f2.id SECOND),
        CEILING(RAND(20100629) * 10),
        5 - FLOOR(LOG10(CEILING(RAND(20100629 &lt;&lt; 1) * 99999))),
        CONCAT(&#039;Message &#039;, (f1.id - 1) * 3000 + f2.id)
FROM    filler f1
CROSS JOIN
        filler f2;

CREATE INDEX ix_log_source_ts ON t_log (source, ts);

CREATE INDEX ix_log_source_severity_ts ON t_log (source, severity, ts);
</pre>
</div>
<p>This <strong>MyISAM</strong> table has <strong>12,250,000</strong> records, with <strong>10</strong> random sources (distributed evenly) and <strong>5</strong> random severities (distributed logarithmically):</p>
<pre class="brush: sql">
SELECT  severity, COUNT(*)
FROM    t_log
GROUP BY
        severity;

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>severity</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="char">DEBUG</td>
<td class="bigint">11024646</td>
</tr>
<tr>
<td class="char">INFO</td>
<td class="bigint">1102668</td>
</tr>
<tr>
<td class="char">WARNING</td>
<td class="bigint">110557</td>
</tr>
<tr>
<td class="char">ERROR</td>
<td class="bigint">10948</td>
</tr>
<tr>
<td class="char">FATAL</td>
<td class="bigint">1181</td>
</tr>
</table>
</div>
<p>We also created two indexes (one on <code>(source, ts)</code>, the other one on <code>(source, severity, ts)</code>).</p>
<p>Now, let&#8217;s try to run some queries as is:</p>
<pre class="brush: sql">
SELECT  *
FROM    t_log
WHERE   source = 2
        AND severity IN (1, 2)
        AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1205</td>
<td class="bigint">1277753995</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1205</td>
</tr>
<tr>
<td class="integer">1227</td>
<td class="bigint">1277753973</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1227</td>
</tr>
<tr>
<td class="integer">1243</td>
<td class="bigint">1277753957</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1243</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">1546</td>
<td class="bigint">1277753654</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1546</td>
</tr>
<tr>
<td class="integer">1575</td>
<td class="bigint">1277753625</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1575</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.0027s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_ts</td>
<td class="varchar">9</td>
<td class="varchar"></td>
<td class="bigint">997923</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (1,2)) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30
</pre>
<p>This is very fast. It uses the index which does not include <code>severity</code>: since the values <strong>1</strong> and <strong>2</strong> are very frequent, it&#8217;s much more efficient just to filter them out. The index preserves the order, that&#8217;s why there is no <code>filesort</code> in the plan.</p>
<pre class="brush: sql">
SELECT  *
FROM    t_log
WHERE   source = 2
        AND severity IN (4, 5)
        AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">6139</td>
<td class="bigint">1277749061</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 6139</td>
</tr>
<tr>
<td class="integer">6369</td>
<td class="bigint">1277748831</td>
<td class="tinyint">2</td>
<td class="char">FATAL</td>
<td class="varchar">Message 6369</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">297128</td>
<td class="bigint">1277458072</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 297128</td>
</tr>
<tr>
<td class="integer">298729</td>
<td class="bigint">1277456471</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 298729</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.0093s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">1182</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using filesort</td>
</tr>
</table>
</div>
<pre>
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (4,5)) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30
</pre>
<p>This is very fast too. The index which includes <code>severity</code> is used (along with the <code>filesort</code> of course, because the order cannot be preserved with multiple values of <code>severity</code>), but the total number of records evaluated is so small that the <code>filesort</code> is not much of a problem.</p>
<p>Now, let&#8217;s try to include <strong>3</strong> into the query above:</p>
<pre class="brush: sql">
SELECT  *
FROM    t_log
WHERE   source = 2
        AND severity IN (3, 4)
        AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1507</td>
<td class="bigint">1277753693</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 1507</td>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">4154</td>
<td class="bigint">1277751046</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 4154</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">30118</td>
<td class="bigint">1277725082</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 30118</td>
</tr>
<tr>
<td class="integer">31321</td>
<td class="bigint">1277723879</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 31321</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.2496s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">12168</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using filesort</td>
</tr>
</table>
</div>
<pre>
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (3,4)) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30
</pre>
<p>Now, this runs for almost <strong>250 ms</strong>. Why?</p>
<p>There are <strong>110,557</strong> records with <code>severity = 'WARNING'</code>. This is too many for a filesort but too few for <code>using where</code> (filtering the records with the index that preserves the order). There will be too many records that will need to be skipped.</p>
<p>To work around this, we could combine the queries using <code>UNION ALL</code>. Since the original query uses <code>ORDER BY</code> and <code>LIMIT</code>, we may put them into two separate queries (which will yield <strong>60</strong> records) and finally apply it to the end resultset (to get the <strong>30</strong> records that are guaranteed to be contained among these <strong>60</strong>):</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  *
        FROM    t_log
        WHERE   source = 2
                AND severity = 3
                AND ts &lt;= 1277754000
        ORDER BY
                source DESC, ts DESC
        LIMIT 30
        ) q
UNION ALL
SELECT  *
FROM    (
        SELECT  *
        FROM    t_log
        WHERE   source = 2
                AND severity = 4
                AND ts &lt;= 1277754000
        ORDER BY
                source DESC, ts DESC
        LIMIT 30
        ) q
ORDER BY
        ts DESC
LIMIT 30
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1507</td>
<td class="bigint">1277753693</td>
<td class="tinyint">2</td>
<td class="varchar">WARNING</td>
<td class="varchar">Message 1507</td>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="varchar">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">4154</td>
<td class="bigint">1277751046</td>
<td class="tinyint">2</td>
<td class="varchar">WARNING</td>
<td class="varchar">Message 4154</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">30118</td>
<td class="bigint">1277725082</td>
<td class="tinyint">2</td>
<td class="varchar">WARNING</td>
<td class="varchar">Message 30118</td>
</tr>
<tr>
<td class="integer">31321</td>
<td class="bigint">1277723879</td>
<td class="tinyint">2</td>
<td class="varchar">ERROR</td>
<td class="varchar">Message 31321</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.0037s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">11094</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">UNION</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">1074</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union1,3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Using filesort</td>
</tr>
</table>
</div>
<pre>
select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` = 3) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30) `q` union all select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` = 4) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30) `q` order by `ts` desc limit 30
</pre>
<p>This is much faster.</p>
<p>However, this solution requires composing the query dynamically, depending on the number of the severities in the condition. Is it possible to make this all in one static query that will accept the parameters in the <code>IN</code> list?</p>
<p>We can do it by using the applying the solution using to retrieve <q>greatest-n-per-group</q> in <strong>MySQL</strong>.</p>
<p>To do this, we will just select the <strong>30</strong>th timestamp of each <code>severity</code> and find all records with the higher timestamps.</p>
<p>This can be done using a join:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  l.*
        FROM    (
                SELECT  source,
                        severity,
                        (
                        SELECT  ts
                        FROM    t_log li
                        WHERE   li.source = ss.source
                                AND li.severity = ss.severity
                                AND ts &lt;= 1277754000
                        ORDER BY
                                li.source DESC, li.severity DESC, li.ts DESC
                        LIMIT 29, 1
                        ) AS mts
                FROM    (
                        SELECT  DISTINCT source, severity
                        FROM    t_log
                        WHERE   source = 2
                                AND severity IN (3, 4)
                        ) ss
                ) s
        JOIN    t_log l
        ON      l.source &gt;= s.source
                AND l.source &lt;= s.source
                AND l.severity = s.severity
                AND l.ts &gt;= s.mts
                AND l.ts &lt;= 1277754000
        ) q
ORDER BY
        ts DESC
LIMIT 30;

</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1507</td>
<td class="bigint">1277753693</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 1507</td>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">4154</td>
<td class="bigint">1277751046</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 4154</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">30118</td>
<td class="bigint">1277725082</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 30118</td>
</tr>
<tr>
<td class="integer">31321</td>
<td class="bigint">1277723879</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 31321</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0014s (0.0040s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">60</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">l</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">40833332.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;3)</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived5&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">2</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index for group-by</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">li</td>
<td class="varchar">ref</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">2</td>
<td class="varchar">ss.source,ss.severity</td>
<td class="bigint">245000</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;ss.source&#39; of SELECT #4 was resolved in SELECT #3
Field or reference &#39;ss.severity&#39; of SELECT #4 was resolved in SELECT #3
select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`l`.`id` AS `id`,`20100630_range`.`l`.`ts` AS `ts`,`20100630_range`.`l`.`source` AS `source`,`20100630_range`.`l`.`severity` AS `severity`,`20100630_range`.`l`.`tx` AS `tx` from (select `ss`.`source` AS `source`,`ss`.`severity` AS `severity`,(select `20100630_range`.`li`.`ts` from `20100630_range`.`t_log` `li` where ((`20100630_range`.`li`.`source` = `ss`.`source`) and (`20100630_range`.`li`.`severity` = `ss`.`severity`) and (`20100630_range`.`li`.`ts` &lt;= 1277754000)) order by `20100630_range`.`li`.`source` desc,`20100630_range`.`li`.`severity` desc,`20100630_range`.`li`.`ts` desc limit 29,1) AS `mts` from (select distinct `20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (3,4)))) `ss`) `s` join `20100630_range`.`t_log` `l` where ((`20100630_range`.`l`.`severity` = `s`.`severity`) and (`20100630_range`.`l`.`source` &gt;= `s`.`source`) and (`20100630_range`.`l`.`source` &lt;= `s`.`source`) and (`20100630_range`.`l`.`ts` &gt;= `s`.`mts`) and (`20100630_range`.`l`.`ts` &lt;= 1277754000))) `q` order by `q`.`ts` desc limit 30
</pre>
<p>All possible values of <code>source</code> and <code>severity</code> are selected using a loose scan (which is instant since there are few of them). Each pair of values is then used as a join condition. A single index range satisfies each pair of values, so each join iteration uses an index efficiently (actually, the access path is reevaluated for each iteration as shown by <code>Range checked for each record (index map: 0x3)</code>.</p>
<p>The total number of records that would be returned by this query be there no <code>LIMIT</code> is <strong>60</strong> or maybe even more (in case of ties on <code>ts</code>). However, we don&#8217;t need to resolve the ties in the subqueries, since the final <code>ORDER BY / LIMIT</code> does this for us.</p>
<p>The query completes in <strong>4 ms</strong> which is instant. More than that, it does not need to be rewritten to handle different combinations of values: they could be provided in a single <code>IN</code> clause.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
<div class='wp_fbl_bottom' style='text-align:'><!-- Wordbooker created FB tags --> <iframe src="https://www.facebook.com/plugins/like.php?locale=en_US&amp;href=http://explainextended.com/2010/06/30/indexing-for-order-by-limit/&amp;layout=standard&amp;show_faces=false&amp;width=250&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=35px" style="border:none; overflow:hidden; width:250px; height:35px;" ></iframe></div>]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/06/30/indexing-for-order-by-limit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
