<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>EXPLAIN EXTENDED &#187; MySQL</title>
	<atom:link href="http://explainextended.com/category/mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://explainextended.com</link>
	<description>How to create fast database queries</description>
	<lastBuildDate>Tue, 09 Mar 2010 18:43:36 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Aggregates and LEFT JOIN</title>
		<link>http://explainextended.com/2010/03/05/aggregates-and-left-join/</link>
		<comments>http://explainextended.com/2010/03/05/aggregates-and-left-join/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 20:00:09 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4548</guid>
		<description><![CDATA[From Stack Overflow:
I have a table product with products and table sale with all sale operations that were done on these products.
I would like to get 10 most often sold products today and what I did is this:

SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
       [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2388122/how-to-increase-last-day-count-query-performance"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>I have a table <code>product</code> with products and table <code>sale</code> with all sale operations that were done on these products.</p>
<p>I would like to get <strong>10</strong> most often sold products today and what I did is this:</p>
<pre class="brush: sql">
SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt &gt;= &#039;2010-01-01&#039;
        AND s.dt &lt; &#039;2010-01-02&#039;
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10
</pre>
<p>, but performance of it is very slow.</p>
<p>What can I do to increase performance of this particular query?
</p></blockquote>
<p>The query involves a <code>LEFT JOIN</code> which in <strong>MySQL</strong> world means that <code>products</code> will be made leading in the query. Each record of <code>product</code> will be taken and checked against <code>sale</code> table to find out the number of matching records. If no matching records are found, <strong>0</strong> is returned.</p>
<p>Let&#8217;s create the sample tables:<br />
<span id="more-4548"></span><br />
<a href="#" onclick="xcollapse('X9754');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X9754" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE TABLE product (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE sale (
        id INT NOT NULL PRIMARY KEY,
        product_id INT NOT NULL,
        amount FLOAT NOT NULL,
        dt DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    product
SELECT  id, CONCAT(&#039;Product &#039;, id)
FROM    filler;

INSERT
INTO    sale (id, product_id, amount, dt)
SELECT  id,
        FLOOR(RAND(20100305) * 500000) + 1,
        RAND(20100305 &lt;&lt; 1) * 100 + 1,
        &#039;2010-03-06&#039; - INTERVAL id MINUTE
FROM    filler;

CREATE INDEX ix_sale_product_dt ON sale (product_id, dt);
CREATE INDEX ix_sale_dt_product ON sale (dt, product_id);
</pre>
</div>
<p>The table contains <strong>500,000</strong> products and <strong>500,000</strong> random sales (<strong>1,440</strong> sales per day).</p>
<p>Now, let&#8217;s run the query similar to the author&#8217;s one. I adjusted the period so that fewer than <strong>10</strong> actual sales were made during the period and <code>LEFT JOIN</code> records can be seen in the table:</p>
<pre class="brush: sql">
SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt &gt;= &#039;2010-01-01&#039;
        AND s.dt &lt; &#039;2010-01-01 00:07:00&#039;
GROUP BY
        p.id
ORDER BY
        sumsell DESC, p.id
LIMIT 10
</pre>
<p><a href="#" onclick="xcollapse('X3182');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X3182" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>sumsell</th>
</tr>
<tr>
<td class="integer">50630</td>
<td class="varchar">Product 50630</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">143395</td>
<td class="varchar">Product 143395</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">222114</td>
<td class="varchar">Product 222114</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">322966</td>
<td class="varchar">Product 322966</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">340133</td>
<td class="varchar">Product 340133</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">345789</td>
<td class="varchar">Product 345789</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">462937</td>
<td class="varchar">Product 462937</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Product 1</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="varchar">Product 2</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="varchar">Product 3</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0009s (3.0312s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">p</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">5007270.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">s</td>
<td class="varchar">ref</td>
<td class="varchar">ix_sale_product_dt,ix_sale_dt_product</td>
<td class="varchar">ix_sale_product_dt</td>
<td class="varchar">4</td>
<td class="varchar">20100305_left.p.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,count(`20100305_left`.`s`.`id`) AS `sumsell` from `20100305_left`.`product` `p` left join `20100305_left`.`sale` `s` on(((`20100305_left`.`s`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`s`.`dt` &gt;= &#39;2010-01-01&#39;) and (`20100305_left`.`s`.`dt` &lt; &#39;2010-01-01 00:07:00&#39;))) where 1 group by `20100305_left`.`p`.`id` order by count(`20100305_left`.`s`.`id`) desc,`20100305_left`.`p`.`id` limit 10
</pre>
</div>
<p>The query runs for <strong>3 seconds</strong>.</p>
<p>We see that, first, <code>product</code> is made leading, and, second, only a part of the index on <code>sale (product, dt)</code> is used: each sale is only filtered on product, not on date.</p>
<p>Since there were only <strong>7</strong> sales during the period we have chosen, it would be a wise decision to make <code>sale</code> leading in the join so that it could be filtered on date and the resulting recordset was then joined to <code>product</code>. This would result in at most <strong>7</strong> <code>PRIMARY KEY</code> seeks instead of <strong>500,000</strong> range scans and would be much more efficient.</p>
<p>However, this is only possible with the <code>INNER JOIN</code>, and if there are less then <strong>10</strong> products sold within the time period, we will not see the rest.</p>
<p>To work around this, we need to emulate the <code>LEFT JOIN</code>:</p>
<ol>
<li>
<p>Find the products sold within the time period, using an <code>INNER JOIN</code> of <code>product</code> with the resultset containg aggregated sales.</p>
</li>
<li>
<p>Find the products <strong>not</strong> sold within the time period, using <code>NOT EXISTS</code> predicate.</p>
</li>
<li>
<p>Concatenate the two resultsets using <code>UNION ALL</code>.</p>
</li>
</ol>
<p>The step <strong>2</strong> implies that <code>product</code> is leading again, so normally it would not be much of improvement. But in our case, we don&#8217;t need the whole recordset, we only need the top <strong>10</strong> sales.</p>
<p>So we can just order and limit the recordsets retrieved on steps <strong>1</strong> and <strong>2</strong> to ten records each, concatenate them, then order and limit the resulting recordset again to ten records.</p>
<p>The second resultset will contain a hardcoded <strong>0</strong> in the <code>sumsell</code>, so we just need to order it on <code>product.id</code>. Since <code>product</code> is an <strong>InnoDB</strong> table and <code>product.id</code> is a clustered <code>PRIMARY KEY</code>, this is not a problem.</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  p.*, sumsell
FROM    (
        SELECT  *
        FROM    (
                SELECT  product_id, sumsell
                FROM    (
                        SELECT  product_id, COUNT(*) AS sumsell
                        FROM    sale si
                        WHERE   dt &gt;= &#039;2010-01-01&#039;
                                AND dt &lt; &#039;2010-01-01 00:07:00&#039;
                        GROUP BY
                                product_id
                        ) si
                ORDER BY
                        sumsell DESC, product_id
                LIMIT 10
                ) q1
        UNION ALL
        SELECT  *
        FROM    (
                SELECT  p.id, 0
                FROM    product p
                WHERE   NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    sale si
                        WHERE   product_id = p.id
                                AND dt &gt;= &#039;2010-01-01&#039;
                                AND dt &lt; &#039;2010-01-01 00:07:00&#039;
                        )
                ORDER BY
                        p.id
                LIMIT 10
                ) q2
        ORDER BY
                sumsell DESC, product_id
        LIMIT 10
        ) q
JOIN    product p
ON      p.id = q.product_id
</pre>
<p><a href="#" onclick="xcollapse('X1041');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1041" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>sumsell</th>
</tr>
<tr>
<td class="integer">50630</td>
<td class="varchar">Product 50630</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">143395</td>
<td class="varchar">Product 143395</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">222114</td>
<td class="varchar">Product 222114</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">322966</td>
<td class="varchar">Product 322966</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">340133</td>
<td class="varchar">Product 340133</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">345789</td>
<td class="varchar">Product 345789</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">462937</td>
<td class="varchar">Product 462937</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Product 1</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="varchar">Product 2</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="varchar">Product 3</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0012s (0.0064s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.product_id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">7</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">7</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">si</td>
<td class="varchar">range</td>
<td class="varchar">ix_sale_dt_product</td>
<td class="varchar">ix_sale_dt_product</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">6</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">UNION</td>
<td class="varchar">&lt;derived6&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">DERIVED</td>
<td class="varchar">p</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">5007270.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">7</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">si</td>
<td class="varchar">ref</td>
<td class="varchar">ix_sale_product_dt,ix_sale_dt_product</td>
<td class="varchar">ix_sale_product_dt</td>
<td class="varchar">4</td>
<td class="varchar">20100305_left.p.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union2,5&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Using filesort</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100305_left.p.id&#39; of SELECT #7 was resolved in SELECT #6
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,`q`.`sumsell` AS `sumsell` from (select `q1`.`product_id` AS `product_id`,`q1`.`sumsell` AS `sumsell` from (select `si`.`product_id` AS `product_id`,`si`.`sumsell` AS `sumsell` from (select `20100305_left`.`si`.`product_id` AS `product_id`,count(0) AS `sumsell` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`dt` &gt;= &#39;2010-01-01&#39;) and (`20100305_left`.`si`.`dt` &lt; &#39;2010-01-01 00:07:00&#39;)) group by `20100305_left`.`si`.`product_id`) `si` order by `si`.`sumsell` desc,`si`.`product_id` limit 10) `q1` union all select `q2`.`id` AS `id`,`q2`.`0` AS `0` from (select `20100305_left`.`p`.`id` AS `id`,0 AS `0` from `20100305_left`.`product` `p` where (not(exists(select NULL AS `NULL` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`si`.`dt` &gt;= &#39;2010-01-01&#39;) and (`20100305_left`.`si`.`dt` &lt; &#39;2010-01-01 00:07:00&#39;))))) order by `20100305_left`.`p`.`id` limit 10) `q2` order by `sumsell` desc,`product_id` limit 10) `q` join `20100305_left`.`product` `p` where (`20100305_left`.`p`.`id` = `q`.`product_id`)
</pre>
</div>
<p>This query completes in less than <strong>7 ms</strong> (which is comparable to the time measurement error).</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/03/05/aggregates-and-left-join/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Matching sets: aggregates vs. first miss</title>
		<link>http://explainextended.com/2010/02/25/matching-sets-aggregates-vs-first-miss/</link>
		<comments>http://explainextended.com/2010/02/25/matching-sets-aggregates-vs-first-miss/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 20:00:47 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4436</guid>
		<description><![CDATA[From Stack Overflow:

Here is my schema:

Suppliers

sid
sname
address 



Parts

pid
pname
color



Catalog

sid
pid
cost


I need to find the sids of suppliers who supply every red part or every green part.

This task requires matching the sets.
We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2328457/mysql-how-can-i-condense-this-verbose-query"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>
Here is my schema:</p>
<table class="excel">
<caption>Suppliers</caption>
<tr>
<th>sid</th>
<th>sname</th>
<th>address </th>
</tr>
</table>
<table class="excel">
<caption>Parts</caption>
<tr>
<th>pid</th>
<th>pname</th>
<th>color</th>
</tr>
</table>
<table class="excel">
<caption>Catalog</caption>
<tr>
<th>sid</th>
<th>pid</th>
<th>cost</th>
</tr>
</table>
<p>I need to find the sids of suppliers who supply every <strong>red</strong> part or every <strong>green</strong> part.
</p></blockquote>
<p>This task requires matching the sets.</p>
<p>We need to compare two sets here: the first one is the set of the parts of given color; the second one is the set of parts provided by a given supplier. The former should be the subset of the latter.</p>
<p>Unlike other engines, <strong>MySQL</strong> does not provide the set operators like <code>EXCEPT</code> or <code>MINUS</code> that allow to check the subset / superset relationship very easily. We have to use the record-based solutions.</p>
<p>There are two ways to check that:</p>
<ul>
<li><q>First miss</q> technique: test each record from the subset candidate against the superset candidate, returning <code>FALSE</code> if there is no match.</li>
<li><q>Aggregate</q> technique: compare the number of records in the subset candidate to the number of records in their intersection. If the numbers are equal, the sets match</li>
</ul>
<p>Let&#8217;s test which way is faster in which cases. To do this, we will need some sample tables:<br />
<span id="more-4436"></span><br />
<a href="#" onclick="xcollapse('X6352');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X6352" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE suppliers (
        sid INT NOT NULL PRIMARY KEY,
        sname VARCHAR(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE parts (
        pid INT NOT NULL PRIMARY KEY,
        pname VARCHAR(100) NOT NULL,
        color VARCHAR(100) NOT NULL,
        KEY ix_parts_color (color)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE catalog (
        sid INT NOT NULL,
        pid INT NOT NULL,
        cost REAL NOT NULL,
        PRIMARY KEY pk_catalog_sp(sid, pid),
        KEY ix_catalog_pid (pid)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(10000);
COMMIT;

INSERT
INTO    suppliers
SELECT  id, CONCAT(&#039;Supplier &#039;, id)
FROM    filler
ORDER BY
        id
LIMIT 10000;

INSERT
INTO    parts
SELECT  id, CONCAT(&#039;Part &#039;, id), CASE WHEN id &lt;= 20 THEN ELT(id % 2 + 1, &#039;red&#039;, &#039;green&#039;) ELSE &#039;blue&#039; END
FROM    filler
ORDER BY
        id
LIMIT 2000;

INSERT
INTO    catalog
SELECT  sid, pid, 100
FROM    (
        SELECT  sid, pid, RAND(20100225) AS rnd
        FROM    suppliers
        JOIN    parts
        ON      color IN (&#039;red&#039;, &#039;green&#039;)
        ) q
WHERE   rnd &lt; 0.4;

INSERT
INTO    catalog
SELECT  sid, pid, 100
FROM    (
        SELECT  sid, pid, RAND(20100225 &lt;&lt; 1) AS rnd
        FROM    (
                SELECT  sid
                FROM    suppliers
                ORDER BY
                        sid
                LIMIT 200
                ) q
        JOIN    parts
        ON      color = &#039;blue&#039;
        ) q2
WHERE   rnd &lt; 0.998;
</pre>
</div>
<p>There are <strong>10,000</strong> suppliers and <strong>2,000</strong> parts.</p>
<p>The parts can be red, green or blue. As for red and green, there are <strong>10</strong> parts of each color, and they are distributed evenly across the suppliers. With blue, the situation is different: there are <strong>1980</strong> blue parts and only <strong>200</strong> first suppliers provide it. However, for each of the blue part suppliers, the probability of each blue part to be available is very high.</p>
<h3>First miss</h3>
<p>The first miss is a combination of <code>NOT IN</code> / <code>NOT EXISTS</code> clauses that immediately return <code>FALSE</code> whenever a single non-matching record is found. Since <strong>MySQL</strong> can only do nested loops, the performance of these queries is heavily dependent on proper indexing.</p>
<p>Let&#8217;s run this query to search for red or green parts:</p>
<pre class="brush: sql">
SELECT  *
FROM    suppliers s
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  &#039;red&#039; AS color
                UNION ALL
                SELECT  &#039;green&#039; AS color
                ) ci
        WHERE   color NOT IN
                (
                SELECT  color
                FROM    parts p
                WHERE   p.pid NOT IN
                        (
                        SELECT  pid
                        FROM    catalog c
                        WHERE   c.sid = s.sid
                        )
                )
        )
</pre>
<p><a href="#" onclick="xcollapse('X8563');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X8563" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>sid</th>
<th>sname</th>
</tr>
<tr>
<td class="integer">7442</td>
<td class="varchar">Supplier 7442</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0003s (2.1719s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">s</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10342</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">p</td>
<td class="varchar">ref</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">302</td>
<td class="varchar">func</td>
<td class="bigint">439</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">c</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ix_catalog_pid</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">8</td>
<td class="varchar">20100225_sets.s.sid,func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union3,4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100225_sets.s.sid&#39; of SELECT #6 was resolved in SELECT #1
select `20100225_sets`.`s`.`sid` AS `sid`,`20100225_sets`.`s`.`sname` AS `sname` from `20100225_sets`.`suppliers` `s` where exists(select NULL AS `NULL` from (select &#39;red&#39; AS `color` union all select &#39;green&#39; AS `color`) `ci` where (not(&lt;in_optimizer&gt;(`ci`.`color`,&lt;exists&gt;(select 1 AS `Not_used` from `20100225_sets`.`parts` `p` where ((not(&lt;in_optimizer&gt;(`20100225_sets`.`p`.`pid`,&lt;exists&gt;(select 1 AS `Not_used` from `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`c`.`sid` = `20100225_sets`.`s`.`sid`) and (&lt;cache&gt;(`20100225_sets`.`p`.`pid`) = `20100225_sets`.`c`.`pid`)))))) and (convert(&lt;cache&gt;(`ci`.`color`) using utf8) = `20100225_sets`.`p`.`color`)))))))
</pre>
</div>
<p>This query runs for <strong>2.17 s</strong>.</p>
<p>The same query for the blue parts:</p>
<pre class="brush: sql">
SELECT  *
FROM    suppliers s
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  &#039;blue&#039; AS color
                ) ci
        WHERE   color NOT IN
                (
                SELECT  color
                FROM    parts p
                WHERE   p.pid NOT IN
                        (
                        SELECT  pid
                        FROM    catalog c
                        WHERE   c.sid = s.sid
                        )
                )
        )
</pre>
<p><a href="#" onclick="xcollapse('X8564');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X8564" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>sid</th>
<th>sname</th>
</tr>
<tr>
<td class="integer">10</td>
<td class="varchar">Supplier 10</td>
</tr>
<tr>
<td class="integer">96</td>
<td class="varchar">Supplier 96</td>
</tr>
<tr>
<td class="integer">101</td>
<td class="varchar">Supplier 101</td>
</tr>
<tr class="statusbar">
<td colspan="100">3 rows fetched in 0.0005s (2.4375s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">s</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10342</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">system</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">p</td>
<td class="varchar">ref</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">302</td>
<td class="varchar">func</td>
<td class="bigint">439</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">c</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ix_catalog_pid</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">8</td>
<td class="varchar">20100225_sets.s.sid,func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100225_sets.s.sid&#39; of SELECT #5 was resolved in SELECT #1
select `20100225_sets`.`s`.`sid` AS `sid`,`20100225_sets`.`s`.`sname` AS `sname` from `20100225_sets`.`suppliers` `s` where exists(select NULL AS `NULL` from (select &#39;blue&#39; AS `color`) `ci` where (not(&lt;in_optimizer&gt;(&#39;blue&#39;,&lt;exists&gt;(select 1 AS `Not_used` from `20100225_sets`.`parts` `p` where ((not(&lt;in_optimizer&gt;(`20100225_sets`.`p`.`pid`,&lt;exists&gt;(select 1 AS `Not_used` from `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`c`.`sid` = `20100225_sets`.`s`.`sid`) and (&lt;cache&gt;(`20100225_sets`.`p`.`pid`) = `20100225_sets`.`c`.`pid`)))))) and (convert(&lt;cache&gt;(&#39;blue&#39;) using utf8) = `20100225_sets`.`p`.`color`)))))))
</pre>
</div>
<p>This time the query is a little bit slower: <strong>2.43 s</strong>.</p>
<h3>Aggregate</h3>
<p>In the aggregate solution, we first calculate the total number of parts for each color and the count the parts of this color supplied by each supplier. If the counts match, the sets match too.</p>
<p>Let&#8217;s try it on red and green parts first:</p>
<pre class="brush: sql">
SELECT  c.sid, p.color
FROM    (
        SELECT  color, COUNT(*) AS total
        FROM    parts
        WHERE   color IN (&#039;red&#039;, &#039;green&#039;)
        GROUP BY
                color
        ) t
JOIN    parts p
ON      p.color = t.color
JOIN    catalog c
ON      c.pid = p.pid
GROUP BY
        sid, color, total
HAVING  COUNT(*) = total
</pre>
<p><a href="#" onclick="xcollapse('X8565');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X8565" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>sid</th>
<th>color</th>
</tr>
<tr>
<td class="integer">7442</td>
<td class="varchar">green</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0003s (1.1406s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY,ix_parts_color</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">302</td>
<td class="varchar">t.color</td>
<td class="bigint">439</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">c</td>
<td class="varchar">ref</td>
<td class="varchar">ix_catalog_pid</td>
<td class="varchar">ix_catalog_pid</td>
<td class="varchar">4</td>
<td class="varchar">20100225_sets.p.pid</td>
<td class="bigint">109</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">parts</td>
<td class="varchar">range</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">302</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
select `20100225_sets`.`c`.`sid` AS `sid`,`20100225_sets`.`p`.`color` AS `color` from (select `20100225_sets`.`parts`.`color` AS `color`,count(0) AS `total` from `20100225_sets`.`parts` where (`20100225_sets`.`parts`.`color` in (&#39;red&#39;,&#39;green&#39;)) group by `20100225_sets`.`parts`.`color`) `t` join `20100225_sets`.`parts` `p` join `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`p`.`color` = `t`.`color`) and (`20100225_sets`.`c`.`pid` = `20100225_sets`.`p`.`pid`)) group by `20100225_sets`.`c`.`sid`,`20100225_sets`.`p`.`color`,`t`.`total` having (count(0) = `t`.`total`)
</pre>
</div>
<p>And the same query on the blue parts:</p>
<pre class="brush: sql">
SELECT  c.sid, p.color
FROM    (
        SELECT  color, COUNT(*) AS total
        FROM    parts
        WHERE   color IN (&#039;blue&#039;)
        GROUP BY
                color
        ) t
JOIN    parts p
ON      p.color = t.color
JOIN    catalog c
ON      c.pid = p.pid
GROUP BY
        sid, color, total
HAVING  COUNT(*) = total
</pre>
<p><a href="#" onclick="xcollapse('X8566');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X8566" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>sid</th>
<th>color</th>
</tr>
<tr>
<td class="integer">10</td>
<td class="varchar">blue</td>
</tr>
<tr>
<td class="integer">96</td>
<td class="varchar">blue</td>
</tr>
<tr>
<td class="integer">101</td>
<td class="varchar">blue</td>
</tr>
<tr class="statusbar">
<td colspan="100">3 rows fetched in 0.0004s (3.0906s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">system</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY,ix_parts_color</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">302</td>
<td class="varchar">const</td>
<td class="bigint">1319</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">c</td>
<td class="varchar">ref</td>
<td class="varchar">ix_catalog_pid</td>
<td class="varchar">ix_catalog_pid</td>
<td class="varchar">4</td>
<td class="varchar">20100225_sets.p.pid</td>
<td class="bigint">109</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">parts</td>
<td class="varchar">ref</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">ix_parts_color</td>
<td class="varchar">302</td>
<td class="varchar"></td>
<td class="bigint">1319</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
select `20100225_sets`.`c`.`sid` AS `sid`,`20100225_sets`.`p`.`color` AS `color` from (select `20100225_sets`.`parts`.`color` AS `color`,count(0) AS `total` from `20100225_sets`.`parts` where (`20100225_sets`.`parts`.`color` = &#39;blue&#39;) group by `20100225_sets`.`parts`.`color`) `t` join `20100225_sets`.`parts` `p` join `20100225_sets`.`catalog` `c` where ((`20100225_sets`.`p`.`color` = &#39;blue&#39;) and (`20100225_sets`.`c`.`pid` = `20100225_sets`.`p`.`pid`)) group by `20100225_sets`.`c`.`sid` having (count(0) = &#39;1980&#39;)
</pre>
</div>
<p>We see that the queries complete in <strong>1.14 seconds</strong> and <strong>3.09 seconds</strong>, accordingly.</p>
<p>The <q>aggregate</q> method is more efficient for the red and green parts, while the <q>first miss</q> is more efficient for the blue parts.</p>
<h3>Analysis</h3>
<p>What is the reason of such a difference in performance?</p>
<p>The <q>first miss</q> method generally needs to parse less records (only those before the first miss), but each record is searched for in a nested loop, starting from the index root. The aggregate method needs to parse all records to calculate the <code>COUNT(*)</code>, but these records are fetched in a sequential index access.</p>
<p>The red and green parts have the large number of small sets, with at most <strong>10</strong> records in each. The probability of the miss is relatively small: a large number of records should be parsed before any record is missed in a set. The aggregates, on the other hand, can be calculated very fast, since there are relatively few records to aggregate.</p>
<p>With the blue parts, the situation is different. There are few large sets, and calculating the aggregates requires fetching, sorting and grouping of too many records. First misses, on the other hand, occurs almost instantly: the vast majority of the suppliers do not offer any blue parts at all.</p>
<h3>Summary</h3>
<p>As it often happens, the performance of the two methods to compare the sets depends on the data distribution.</p>
<p>The sets with less records and lower probability of the record miss will benefit from the aggregate method, since the performance increase caused by the sequential access to the records overweights the need to parse a larger number of the records.</p>
<p>The sets with more records and higher probability of a miss will return the misses very soon, so the first miss method is more beneficial for them.</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/02/25/matching-sets-aggregates-vs-first-miss/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tags in nested sets: efficient indexing</title>
		<link>http://explainextended.com/2010/02/16/tags-in-nested-sets-efficient-indexing/</link>
		<comments>http://explainextended.com/2010/02/16/tags-in-nested-sets-efficient-indexing/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 20:00:11 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4326</guid>
		<description><![CDATA[Answering questions asked on the site.
Travis asks:
I just read your excellent post: Hierarchical data in MySQL: parents and children in one query.
I am currently trying to do something very similar to this.  The main difference is, I am working with data that is in a nested set.
I would like to construct a query that [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Travis</strong> asks:</p>
<blockquote><p>I just read your excellent post: <a href="/2009/07/20/hierarchical-data-in-mysql-parents-and-children-in-one-query/"><strong>Hierarchical data in MySQL: parents and children in one query</strong></a>.</p>
<p>I am currently trying to do something very similar to this.  The main difference is, I am working with data that is in a nested set.</p>
<p>I would like to construct a query that will return a resultset of nodes matched with <code>LIKE</code>, the ancestors of each of these nodes, and the immediate children of each of these nodes.
</p></blockquote>
<p>This is a very interesting question: it allows to demonstrate the usage of three types of indexes usable by <strong>MyISAM</strong> tables: <code>BTREE</code>, <code>SPATIAL</code> and <code>FULLTEXT</code>.</p>
<p><a href="http://en.wikipedia.org/wiki/Nested_set_model">Nested sets</a> is one of two models that are most often used to store hierarchical data in a relational table (the other one being <a href="http://en.wikipedia.org/wiki/Adjacency_list">adjacency list model</a>). Unlike adjacency list, nested sets does not require ability to write recursive queries, which <strong>SQL</strong> originally lacked and still lacks now in <strong>MySQL</strong> (albeit it can be emulated to some extent). It is widely used in <strong>MySQL</strong> world.</p>
<p>I described both methods and their comparison in the article I wrote some time ago:</p>
<ul>
<li><a href="/2009/09/29/adjacency-list-vs-nested-sets-mysql/"><strong>Adjacency list vs. nested sets: MySQL</strong></a></li>
</ul>
<p>The main problem with the nested sets model is that though it is extremely fast with selecting descendants of a given node, it is very slow in selecting ancestors.</p>
<p>This is because of the way the <strong>B-Tree</strong> indexes work. They can be used to query for values of a column within the range of two constants, but not for the values of two columns holding a single constant between them. One needs the first condition to select the children (found between the <code>lft</code> and <code>rgt</code> of the given node), and the second condition to select the ancestor (with <code>lft</code> and <code>rgt</code> containing the <code>lft</code> and <code>rgt</code> of the given node).</p>
<p>That&#8217;s why selecting the children is fast and selecting the ancestors is slow.</p>
<p>To work around this, the sets that form the hierarchy can be described as geometrical objects, with the larger sets containing the smaller sets. These sets can be indexed with a <code>SPATIAL</code> index which is designed specially for this purpose and both children and ancestors can be queried for very efficiently.</p>
<p>Unfortunately, finding the depth level is quite a hard task for the nested sets model even with the <code>SPATIAL</code> indexes.</p>
<p>It would be quite an easy task is <strong>MySQL</strong> supported recursion: we could just run a query to find the siblings of each record by skipping their whole domains recursively.</p>
<p>However, <strong>MySQL</strong>&#8217;s recursion support is very limited and it relies on the session variables, which are not recommended to use in the complex queries.</p>
<p>To cope with this, we need to mix the nested sets and the adjacency list models. Hierarchy will be stored in two seemingly redundant ways: the unique <code>parent</code> and the <code>LineString</code> representing the nested sets.</p>
<p>This will help us to use the <strong>R-Tree</strong> index to find all ancestors of a given node and also use <strong>B-Tree</strong> index to find its immediate children.</p>
<p>Finally, the question mentions using <code>LIKE</code> to find the initial nodes. <code>LIKE</code> predicate with the leading wildcards is not sargable in <strong>MySQL</strong>. However, it seems that the leading wildcards are only used to split the words. In this case, a <code>FULLTEXT</code> index and the <code>MATCH </code>query would be much more efficient, since <code>FULLTEXT</code> index allows indexing a single record with several keys (each one corresponding to a single word in the column&#8217;s text), so a search for the word in the space separated or a comma separated list uses the index and is much faster than scanning the whole table.</p>
<p>Hence, the query would use all three main types of indexes: <code>BTREE</code>, <code>SPATIAL</code> and <code>FULLTEXT</code>.</p>
<p>To illustrate everything said above, let&#8217;s create a sample table:<br />
<span id="more-4326"></span><br />
<a href="#" onclick="xcollapse('X9716');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X9716" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_hierarchy (
        id INT NOT NULL PRIMARY KEY,
        parent INT NOT NULL,
        lft INT NOT NULL,
        rgt INT NOT NULL,
        sets LineString NOT NULL,
        data VARCHAR(100) NOT NULL,
        tags TEXT NOT NULL
) ENGINE=MyISAM;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END;

CREATE PROCEDURE prc_hierarchy(width INT)
main:BEGIN
        DECLARE last INT;
        DECLARE level INT;
        SET last = 0;
        SET level = 0;
        WHILE width &gt;= 1 DO
                INSERT
                INTO    t_hierarchy
                SELECT  COALESCE(h.id, 0) * 5 + f.id,
                        COALESCE(h.id, 0),
                        COALESCE(h.lft, 0) + 1 + (f.id - 1) * width,
                        COALESCE(h.lft, 0) + f.id * width,
                        LineString(
                        Point(-1, COALESCE(h.lft, 0) + 1 + (f.id - 1) * width),
                        Point(1, COALESCE(h.lft, 0) + f.id * width)
                        ),
                        CONCAT(&#039;Value &#039;, COALESCE(h.id, 0) * 5 + f.id),
                        (
                        SELECT  GROUP_CONCAT(CONCAT(&#039;tag&#039;, FLOOR(RAND(20100216 + width * 1000) * 300000)) SEPARATOR &#039; &#039;)
                        FROM    (
                                SELECT  1
                                UNION ALL
                                SELECT  1
                                UNION ALL
                                SELECT  1
                                ) q
                        )
                FROM    filler f
                LEFT JOIN
                        t_hierarchy h
                ON      h.id &gt;= last;
                SET width = width / 5;
                SET last = last + POWER(5, level);
                SET level = level + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(5);
CALL prc_hierarchy(117187);
COMMIT;

CREATE INDEX ix_hierarchy_parent ON t_hierarchy (parent);
CREATE UNIQUE INDEX ix_hierarchy_lft ON t_hierarchy (lft);
CREATE UNIQUE INDEX ix_hierarchy_rgt ON t_hierarchy (rgt);
CREATE SPATIAL INDEX sx_hierarchy_sets ON t_hierarchy (sets);
CREATE FULLTEXT INDEX fx_hierarchy_tags ON t_hierarchy (tags);
</pre>
</div>
<p>This table represents a <strong>7</strong> level hierarchy, with <strong>5</strong> children to each non-leaf item and combines nested sets and adjacency list models.</p>
<p>The sets are stored in the plain <code>lft</code> and <code>rgt</code> columns as well as in the combined column of type <code>LineString</code> which represents a diagonal of a box horizontally spanning the interval from <code>lft</code> to <code>rgt</code>.</p>
<h3>Selecting nodes using FULLTEXT</h3>
<p>First, let&#8217;s select the nodes tagged with a given tag.</p>
<p><code>RLIKE</code> can be used for that but it is not very efficient:</p>
<pre class="brush: sql">
SELECT  id, tags
FROM    t_hierarchy
WHERE   tags RLIKE &#039;[[:&lt;:]]tag13480[[:&gt;:]]&#039;
</pre>
<p><a href="#" onclick="xcollapse('X10844');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X10844" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>tags</th>
</tr>
<tr>
<td class="integer">3</td>
<td class="blob">tag13480 tag33087 tag124996</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="blob">tag248489 tag271789 tag13480</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="blob">tag104605 tag13480 tag53585</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="blob">tag181040 tag231320 tag13480</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="blob">tag13480 tag181947 tag269297</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="blob">tag13480 tag176642 tag242772</td>
</tr>
<tr class="statusbar">
<td colspan="100">6 rows fetched in 0.0006s (1.2500s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_hierarchy</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">488280</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20100216_index`.`t_hierarchy`.`id` AS `id`,`20100216_index`.`t_hierarchy`.`tags` AS `tags` from `20100216_index`.`t_hierarchy` where (`20100216_index`.`t_hierarchy`.`tags` regexp &#39;[[:&lt;:]]tag13480[[:&gt;:]]&#39;)
</pre>
</div>
<p>This query uses full table scan and runs for <strong>1.25 second</strong>.</p>
<p>To improve the query, we should rewrite the <code>WHERE</code> condition using <code>MATCH</code> predicate (which in its turn allows a <code>FULLTEXT</code> index to be used for the search):</p>
<pre class="brush: sql">
SELECT  id, tags
FROM    t_hierarchy
WHERE   MATCH(tags) AGAINST(&#039;+tag13480&#039; IN BOOLEAN MODE)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>tags</th>
</tr>
<tr>
<td class="integer">3</td>
<td class="blob">tag13480 tag33087 tag124996</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="blob">tag248489 tag271789 tag13480</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="blob">tag104605 tag13480 tag53585</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="blob">tag181040 tag231320 tag13480</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="blob">tag13480 tag181947 tag269297</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="blob">tag13480 tag176642 tag242772</td>
</tr>
<tr class="statusbar">
<td colspan="100">6 rows fetched in 0.0005s (0.0043s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_hierarchy</td>
<td class="varchar">fulltext</td>
<td class="varchar">fx_hierarchy_tags</td>
<td class="varchar">fx_hierarchy_tags</td>
<td class="varchar">0</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20100216_index`.`t_hierarchy`.`id` AS `id`,`20100216_index`.`t_hierarchy`.`tags` AS `tags` from `20100216_index`.`t_hierarchy` where (match `20100216_index`.`t_hierarchy`.`tags` against (&#39;+tag13480&#39; in boolean mode))
</pre>
<p>This query returns the same results but does it much faster, in only <strong>4 ms</strong>.</p>
<h3>Selecting ancestors using SPATIAL</h3>
<p>Now, when we have a list of nodes, we should build the list of ancestors for each node.</p>
<p>Since our model combines adjacency list and nested sets, it is possible to use either representation to build a query. However, the adjacency list model requires recursion, and, while it is possible to emulate it, it only works for a single-node query.</p>
<p>With nested sets, selecting the list of ancestors is much more simple: we should just selecting all records whose <code>sets</code> contains the <code>sets</code> of the node. This can be done using <code>MBRContains</code> (which is capable of using the <code>SPATIAL</code> index).</p>
<p>The query, however, will return us the ancestors in a plain list. To find out the level of each ancestor, we should put some more effort. Since the sets are nested and the <code>lft</code> and <code>rgt</code> fields naturally maintain the hierarchical order, it is enough just to enumerate the ancestors in that order. It would be very simple to do — if only <strong>MySQL</strong> supported <code>ROW_NUMBER()</code>. But it doesn&#8217;t, of course, so to enumerate the ancestors we should self-join them and just count the number of each ancestor&#8217;s ancestors:</p>
<pre class="brush: sql">
SELECT  hc.id, ha.id, COUNT(*) AS cnt
FROM    t_hierarchy hc
STRAIGHT_JOIN
        t_hierarchy ha
ON      MBRContains(ha.sets, hc.sets)
STRAIGHT_JOIN
        t_hierarchy hcnt
ON      MBRContains(hcnt.sets, ha.sets)
WHERE   MATCH(hc.tags) AGAINST(&#039;+tag13480&#039; IN BOOLEAN MODE)
GROUP BY
        hc.id, ha.id
ORDER BY
        hc.id, cnt
</pre>
<p><a href="#" onclick="xcollapse('X8308');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X8308" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>id</th>
<th>cnt</th>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">3</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">1</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">10</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">51</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">257</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">1290</td>
<td class="bigint">5</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">6454</td>
<td class="bigint">6</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">32275</td>
<td class="bigint">7</td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="integer">161378</td>
<td class="bigint">8</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">2</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">13</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">67</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">340</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">1702</td>
<td class="bigint">5</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">8513</td>
<td class="bigint">6</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">42569</td>
<td class="bigint">7</td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="integer">212847</td>
<td class="bigint">8</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">3</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">20</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">103</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">518</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">2594</td>
<td class="bigint">5</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">12975</td>
<td class="bigint">6</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">64878</td>
<td class="bigint">7</td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="integer">324394</td>
<td class="bigint">8</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">4</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">25</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">127</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">639</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">3200</td>
<td class="bigint">5</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">16002</td>
<td class="bigint">6</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">80014</td>
<td class="bigint">7</td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="integer">400074</td>
<td class="bigint">8</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">5</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">27</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">140</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">704</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">3521</td>
<td class="bigint">5</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">17606</td>
<td class="bigint">6</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">88033</td>
<td class="bigint">7</td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="integer">440166</td>
<td class="bigint">8</td>
</tr>
<tr class="statusbar">
<td colspan="100">41 rows fetched in 0.0036s (0.0258s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">hc</td>
<td class="varchar">fulltext</td>
<td class="varchar">sx_hierarchy_sets,fx_hierarchy_tags</td>
<td class="varchar">fx_hierarchy_tags</td>
<td class="varchar">0</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">ha</td>
<td class="varchar">ALL</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">488280</td>
<td class="double">100.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;10)</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">hcnt</td>
<td class="varchar">ALL</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">488280</td>
<td class="double">100.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;10)</td>
</tr>
</table>
</div>
<pre>
select `20100216_index`.`hc`.`id` AS `id`,`20100216_index`.`ha`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hc` straight_join `20100216_index`.`t_hierarchy` `ha` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hc`.`tags` against (&#39;+tag13480&#39; in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`ha`.`sets`) and contains(`20100216_index`.`ha`.`sets`,`20100216_index`.`hc`.`sets`)) group by `20100216_index`.`hc`.`id`,`20100216_index`.`ha`.`id` order by `20100216_index`.`hc`.`id`,count(0)
</pre>
</div>
<p>The algorithm that calculates the level of each ancestor is not very efficient, however, it does its job quite well, and the query completes in only <strong>20 ms</strong>.</p>
<h3>Selecting immediate children using BTREE</h3>
<p>Selecting the immediate children is the easiest task: we just need an equijoin on <code>parent</code>. The level of each of the node&#8217;s children will be that of the node plus <strong>1</strong>, so calculating it is quite simple too:</p>
<pre class="brush: sql">
SELECT  h.id, hc.id, cnt + 1
FROM    (
        SELECT  hn.id, COUNT(*) AS cnt
        FROM    t_hierarchy hn
        STRAIGHT_JOIN
                t_hierarchy hcnt
        ON      MBRContains(hcnt.sets, hn.sets)
        WHERE   MATCH(hn.tags) AGAINST(&#039;+tag13480&#039; IN BOOLEAN MODE)
        GROUP BY
                hn.id
        ) h
JOIN    t_hierarchy hc
ON      hc.parent = h.id
</pre>
<p><a href="#" onclick="xcollapse('X9305');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X9305" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>id</th>
<th>cnt + 1</th>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">16</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">17</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">18</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">19</td>
<td class="bigint">2</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">20</td>
<td class="bigint">2</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0006s (0.0093s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">6</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">hc</td>
<td class="varchar">ref</td>
<td class="varchar">ix_hierarchy_parent</td>
<td class="varchar">ix_hierarchy_parent</td>
<td class="varchar">4</td>
<td class="varchar">h.id</td>
<td class="bigint">9207</td>
<td class="double">100.01</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">hn</td>
<td class="varchar">fulltext</td>
<td class="varchar">sx_hierarchy_sets,fx_hierarchy_tags</td>
<td class="varchar">fx_hierarchy_tags</td>
<td class="varchar">0</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">hcnt</td>
<td class="varchar">range</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">34</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">48828000.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;10)</td>
</tr>
</table>
</div>
<pre>
select `h`.`id` AS `id`,`20100216_index`.`hc`.`id` AS `id`,(`h`.`cnt` + 1) AS `cnt + 1` from (select `20100216_index`.`hn`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hn` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hn`.`tags` against (&#39;+tag13480&#39; in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`hn`.`sets`)) group by `20100216_index`.`hn`.`id`) `h` join `20100216_index`.`t_hierarchy` `hc` where (`20100216_index`.`hc`.`parent` = `h`.`id`)
</pre>
</div>
<h3>Putting it together</h3>
<p>Now we should just combine the two queries and apply nice formatting to them:</p>
<pre class="brush: sql">
SELECT  h.id,
        CONCAT(LPAD(&#039;&#039;, (level - 1) * 2, &#039; &#039;), h.data) AS name,
        CASE WHEN hq.id = hq.node THEN &#039;*&#039; ELSE &#039;&#039; END AS hit
FROM    (
        SELECT  hc.id AS node, ha.id AS id, COUNT(*) AS level
        FROM    t_hierarchy hc
        STRAIGHT_JOIN
                t_hierarchy ha
        ON      MBRContains(ha.sets, hc.sets)
        STRAIGHT_JOIN
                t_hierarchy hcnt
        ON      MBRContains(hcnt.sets, ha.sets)
        WHERE   MATCH(hc.tags) AGAINST(&#039;+tag13480&#039; IN BOOLEAN MODE)
        GROUP BY
                hc.id, ha.id
        UNION ALL
        SELECT  h.id, hc.id, cnt + 1 AS level
        FROM    (
                SELECT  hn.id, COUNT(*) AS cnt
                FROM    t_hierarchy hn
                STRAIGHT_JOIN
                        t_hierarchy hcnt
                ON      MBRContains(hcnt.sets, hn.sets)
                WHERE   MATCH(hn.tags) AGAINST(&#039;+tag13480&#039; IN BOOLEAN MODE)
                GROUP BY
                        hn.id
                ) h
        JOIN    t_hierarchy hc
        ON      hc.parent = h.id
        ) hq
JOIN    t_hierarchy h
ON      h.id = hq.id
ORDER BY
        node, level, lft
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>hit</th>
</tr>
<tr>
<td class="integer">3</td>
<td class="blob">Value 3</td>
<td class="varchar">*</td>
</tr>
<tr>
<td class="integer">16</td>
<td class="blob">  Value 16</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">17</td>
<td class="blob">  Value 17</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">18</td>
<td class="blob">  Value 18</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">19</td>
<td class="blob">  Value 19</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">20</td>
<td class="blob">  Value 20</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">1</td>
<td class="blob">Value 1</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">10</td>
<td class="blob">  Value 10</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">51</td>
<td class="blob">    Value 51</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">257</td>
<td class="blob">      Value 257</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">1290</td>
<td class="blob">        Value 1290</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">6454</td>
<td class="blob">          Value 6454</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">32275</td>
<td class="blob">            Value 32275</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">161378</td>
<td class="blob">              Value 161378</td>
<td class="varchar">*</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="blob">Value 2</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">13</td>
<td class="blob">  Value 13</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">67</td>
<td class="blob">    Value 67</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">340</td>
<td class="blob">      Value 340</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">1702</td>
<td class="blob">        Value 1702</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">8513</td>
<td class="blob">          Value 8513</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">42569</td>
<td class="blob">            Value 42569</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">212847</td>
<td class="blob">              Value 212847</td>
<td class="varchar">*</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="blob">Value 3</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">20</td>
<td class="blob">  Value 20</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">103</td>
<td class="blob">    Value 103</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">518</td>
<td class="blob">      Value 518</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">2594</td>
<td class="blob">        Value 2594</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">12975</td>
<td class="blob">          Value 12975</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">64878</td>
<td class="blob">            Value 64878</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">324394</td>
<td class="blob">              Value 324394</td>
<td class="varchar">*</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="blob">Value 4</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">25</td>
<td class="blob">  Value 25</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">127</td>
<td class="blob">    Value 127</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">639</td>
<td class="blob">      Value 639</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">3200</td>
<td class="blob">        Value 3200</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">16002</td>
<td class="blob">          Value 16002</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">80014</td>
<td class="blob">            Value 80014</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">400074</td>
<td class="blob">              Value 400074</td>
<td class="varchar">*</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="blob">Value 5</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">27</td>
<td class="blob">  Value 27</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">140</td>
<td class="blob">    Value 140</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">704</td>
<td class="blob">      Value 704</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">3521</td>
<td class="blob">        Value 3521</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">17606</td>
<td class="blob">          Value 17606</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">88033</td>
<td class="blob">            Value 88033</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="integer">440166</td>
<td class="blob">              Value 440166</td>
<td class="varchar">*</td>
</tr>
<tr class="statusbar">
<td colspan="100">46 rows fetched in 0.0039s (0.0466s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">46</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">h</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">hq.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">hc</td>
<td class="varchar">fulltext</td>
<td class="varchar">sx_hierarchy_sets,fx_hierarchy_tags</td>
<td class="varchar">fx_hierarchy_tags</td>
<td class="varchar">0</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">ha</td>
<td class="varchar">range</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">34</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">48828000.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;10)</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">hcnt</td>
<td class="varchar">range</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">34</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">48828000.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;10)</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">UNION</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">6</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">UNION</td>
<td class="varchar">hc</td>
<td class="varchar">ref</td>
<td class="varchar">ix_hierarchy_parent</td>
<td class="varchar">ix_hierarchy_parent</td>
<td class="varchar">4</td>
<td class="varchar">h.id</td>
<td class="bigint">9207</td>
<td class="double">100.01</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">hn</td>
<td class="varchar">fulltext</td>
<td class="varchar">sx_hierarchy_sets,fx_hierarchy_tags</td>
<td class="varchar">fx_hierarchy_tags</td>
<td class="varchar">0</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">hcnt</td>
<td class="varchar">range</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">sx_hierarchy_sets</td>
<td class="varchar">34</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">48828000.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;10)</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union2,3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select `20100216_index`.`h`.`id` AS `id`,concat(convert(lpad(&#39;&#39;,((`hq`.`level` - 1) * 2),&#39; &#39;) using utf8),`20100216_index`.`h`.`data`) AS `name`,(case when (`hq`.`id` = `hq`.`node`) then &#39;*&#39; else &#39;&#39; end) AS `hit` from (select `20100216_index`.`hc`.`id` AS `node`,`20100216_index`.`ha`.`id` AS `id`,count(0) AS `level` from `20100216_index`.`t_hierarchy` `hc` straight_join `20100216_index`.`t_hierarchy` `ha` straight_join `20100216_index`.`t_hierarchy` `hcnt` where (match `20100216_index`.`hc`.`tags` against (&#39;+tag13480&#39; in boolean mode)) group by `20100216_index`.`hc`.`id`,`20100216_index`.`ha`.`id` union all select `h`.`id` AS `id`,`20100216_index`.`hc`.`id` AS `id`,(`h`.`cnt` + 1) AS `level` from (select `20100216_index`.`hn`.`id` AS `id`,count(0) AS `cnt` from `20100216_index`.`t_hierarchy` `hn` straight_join `20100216_index`.`t_hierarchy` `hcnt` where ((match `20100216_index`.`hn`.`tags` against (&#39;+tag13480&#39; in boolean mode)) and contains(`20100216_index`.`hcnt`.`sets`,`20100216_index`.`hn`.`sets`)) group by `20100216_index`.`hn`.`id`) `h` join `20100216_index`.`t_hierarchy` `hc`) `hq` join `20100216_index`.`t_hierarchy` `h` where (`20100216_index`.`h`.`id` = `hq`.`id`) order by `hq`.`node`,`hq`.`level`,`20100216_index`.`h`.`lft`
</pre>
<p>For each matching node, the query returns the node itself, its ancestors and its immediate children. The matching nodes are marked with the asterisk in the field <code>hit</code>.</p>
<p>The query efficiently combines the <code>FULLTEXT</code>, <code>SPATIAL</code> and <code>BTREE</code> indexes and completes in only <strong>40 ms</strong>.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/02/16/tags-in-nested-sets-efficient-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Index search time depends on the value being searched</title>
		<link>http://explainextended.com/2010/02/04/index-search-time-depends-on-the-value-being-searched/</link>
		<comments>http://explainextended.com/2010/02/04/index-search-time-depends-on-the-value-being-searched/#comments</comments>
		<pubDate>Thu, 04 Feb 2010 20:00:58 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4183</guid>
		<description><![CDATA[Answering questions asked on the site.
Daniel asks:
I have a table which stores track titles in a VARCHAR(200) field. The field is indexed, but searching for titles beginning with a letter Z is noticeably slower than for those beginning with A, and the closer the initial letter is to Z, the slower is the query.
My understanding [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Daniel</strong> asks:</p>
<blockquote><p>I have a table which stores track titles in a <code>VARCHAR(200)</code> field. The field is indexed, but searching for titles beginning with a letter <strong>Z</strong> is noticeably slower than for those beginning with <strong>A</strong>, and the closer the initial letter is to <strong>Z</strong>, the slower is the query.</p>
<p>My understanding is that a full table scan occurs, but <code>EXPLAIN</code> shows that the index is used. Besides, the table is quite large but the query is still reasonably fast.</p>
<p>Could you please explain this behavior?</p></blockquote>
<p><strong>MySQL</strong> stores its indexes in <a href="http://en.wikipedia.org/wiki/B-Tree"><strong>B-Tree</strong></a> data structures.</p>
<p>The Wikipedia link above explains the structure quite well so I won&#8217;t repeat it. I&#8217;ll rather draw a picture similar to the one in the article:</p>
<p><img src="http://explainextended.com/wp-content/uploads/2010/02/index.png" alt="" title="B-Tree" width="600" height="120" class="aligncenter size-full wp-image-4207 noborder" /></p>
<p>This picture is quite self-explanatory. The records are sorted in a tree order, so if you are searching for a certain value, say, <strong>11</strong>, you, starting from the first page, should find the link to follow. To do this, you need to find the pair of values less than and greater than <strong>11</strong>. In this case, you should follow the link which is between <strong>8</strong> and <strong>12</strong>. Then you search for the next pair, etc, until you find your value or reach the end and make sure that your value is not there.</p>
<p>Following the links is quite simple, but how does the engine search for the values within one page?</p>
<p>This depends on how you declared the table.</p>
<p><strong>MyISAM</strong> supports two algorithms for storing the index keys in a page: <em>packed keys</em> and <em>unpacked keys</em>.<br />
<span id="more-4183"></span></p>
<p>Unpacked keys are what you are seeing on the picture above: each page just stores the key values and the links to the pages down the tree. This is very simple.</p>
<p>Packed keys are designed to improve performance on character data. Many words and phrases, especially those that are close to each other, start with the same sequence of characters.</p>
<p>If you are going to store track names like:</p>
<table class="excel">
<tr>
<td>The Man Who Sold The World</td>
</tr>
<tr>
<td>The Man Who Invented Himself</td>
</tr>
<tr>
<td>The Man Who Has Everything</td>
</tr>
</table>
<p>, <strong>MyISAM</strong> can optimize it in terms of storage space and store them like this</p>
<table class="excel">
<tr>
<td><strong>(The Man Who)</strong> Sold The World</td>
</tr>
<tr>
<td><strong>(&times;11)</strong> Invented Himself</td>
</tr>
<tr>
<td><strong>(&times;11)</strong> Has Everything</td>
</tr>
</table>
<p>This is called key compression: instead of repeating the key value for each record, <strong>MyISAM</strong> just stores the longest common prefix once and prepends the subsequent records that share it with its lengths. This makes the keys shorter and the index more compact.</p>
<p>However, this affects the index search time.</p>
<p>With an unpacked index, a binary search is applied to find the keys within each level page.</p>
<p>With a packed index, this won&#8217;t work: you need to know the value of two keys to compare them and not every record contains full information about the key.</p>
<p>So in case of a packed index, <strong>MySQL</strong> remembers the value of the prefix and iterates the records one by one. This is less efficient than a binary search, but due to the fact that much more records can fit on on page, this keeps the amount of page traversals to a minimum and overall efficiency increases.</p>
<p>But the records on one page still need to be compared and searched for. And with a linear search, the keys with less values tend to require less iterations than those with greater values.</p>
<p>Let&#8217;s look on the picture above again. The keys are searched left to right.</p>
<p>To search for key <strong>1</strong>, we only need two comparisons: compare to <strong>4</strong>, get the next page, compare to <strong>1</strong>.</p>
<p>But to search for key <strong>15</strong>, we need to compare to <strong>4</strong>, <strong>8</strong>, <strong>12</strong> then get to the next page and compare to <strong>13</strong>, <strong>14</strong> and finally to <strong>15</strong>.</p>
<p>This is <strong>6</strong> operations compared to <strong>2</strong> required to fetch the first key.</p>
<p>Now, let&#8217;s create the sample tables and see some figures:</p>
<p><a href="#" onclick="xcollapse('X9531');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X9531" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_source (
        id INT NOT NULL PRIMARY KEY
) ENGINE=MyISAM;

CREATE TABLE t_packed (
        id INT NOT NULL PRIMARY KEY,
        name CHAR(6) NOT NULL,
        KEY ix_packed_name (name)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;

CREATE TABLE t_unpacked (
        id INT NOT NULL PRIMARY KEY,
        name CHAR(6) NOT NULL,
        KEY ix_unpacked_name (name)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=0;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(50000);
COMMIT;

INSERT
INTO    t_source
SELECT  id
FROM    filler;

INSERT
INTO    t_packed
SELECT  id,
        (
        SELECT  GROUP_CONCAT(CHAR(65 + FLOOR(RAND(20100204) * 26)) SEPARATOR &#039;&#039;)
        FROM    (
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                UNION ALL
                SELECT  NULL
                ) q
        )
FROM    filler;

INSERT
INTO    t_unpacked
SELECT  *
FROM    t_packed;
</pre>
</div>
<p>There are two <strong>MyISAM</strong> tables with randomly generated character sequences like this:</p>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">RTUPPH</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="char">RKZQJW</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="char">FKMEKL</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="char">BYOZFE</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="char">GSTRAF</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="char">YBNMSG</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="char">ZEZKCE</td>
</tr>
<tr>
<td class="integer">8</td>
<td class="char">PMPNUJ</td>
</tr>
<tr>
<td class="integer">9</td>
<td class="char">OQMMYH</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="char">OYAFDZ</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">49999</td>
<td class="char">DZMKRC</td>
</tr>
<tr>
<td class="integer">50000</td>
<td class="char">NHYWLR</td>
</tr>
</table>
</div>
<p>The structure of the tables is identical, except that <code>t_packed</code> packs keys and <code>t_unpacked</code> does not.</p>
<h3>Unpacked keys</h3>
<hr/>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;ABCDEF&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.7656s)</td>
</tr>
</table>
</div>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;GHIJKL&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.7331s)</td>
</tr>
</table>
</div>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;NOPQRS&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.7656s)</td>
</tr>
</table>
</div>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_unpacked p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;ZYXWVU&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.7712s)</td>
</tr>
</table>
</div>
<p>The queries against the strings beginning with <strong>A</strong>, <strong>G</strong>, <strong>N</strong> and <strong>Z</strong> take the same time. All queries have been run several times to populate the cache and the execution times are consistent.</p>
<p>The <code>LEFT JOIN</code> against a non-existent value was used in the query to avoid stopping on a found key and make the query traverse the index as much as possible. We also put a formal dependency on <code>s.id</code> here so that <code>t_source</code> is always leading in the join and no <code>const</code> optimizations are performed.</p>
<h3>Packed keys</h3>
<hr/>
<p>Let&#8217;s try the same queries on packed keys:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;ABCDEF&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (1.9531s)</td>
</tr>
</table>
</div>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;GHIJKL&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (2.3593s)</td>
</tr>
</table>
</div>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;NOPQRS&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (2.7812s)</td>
</tr>
</table>
</div>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_source s
LEFT JOIN
        t_packed p
ON      p.name = IF(s.id &gt; -1, _UTF8&#039;ZYXWVU&#039;, NULL)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">50000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (2.9375s)</td>
</tr>
</table>
</div>
<p>The query against the string beginning with the letter <strong>Z</strong> takes more than <strong>50%</strong> more time than the query against a string beginning with <strong>A</strong>.</p>
<h3>Summary</h3>
<hr/>
<p>Here&#8217;s a little summary table:</p>
<table class="excel">
<tr>
<th>Search string</th>
<th>Unpacked key</th>
<th>Time, %</th>
<th>Packed key</th>
<th>Time, %</th>
</tr>
<tr>
<td><strong>ABCDEF</strong></td>
<td>0.7656</td>
<td class="double">100.00</td>
<td>1.9531</td>
<td class="double">100.00</td>
</tr>
<tr>
<td><strong>GHIJKL</strong></td>
<td>0.7331</td>
<td class="double">95.75</td>
<td>2.3593</td>
<td class="double">120.79</td>
</tr>
<tr>
<td><strong>NOPQRS</strong></td>
<td>0.7656</td>
<td class="double">100.00</td>
<td>2.7812</td>
<td class="double">142.39</td>
</tr>
<tr>
<td><strong>ZYXWVU</strong></td>
<td>0.7712</td>
<td class="double">100.73</td>
<td>2.9375</td>
<td class="double">150.40</td>
</tr>
</table>
<p>We see that the value being searched for does not affect time to search the index with unpacked keys but seriously affects performance of the indexes with packed keys.</p>
<p>This increase is due to linear search used to locate the records within a single page.</p>
<p>Note that not any pair of records in a page (both with packed and unpacked keys) have a corresponding lower-level page containing intermediate values. There can be leaves and branches on the same depth in the tree.</p>
<p>This can lead to some artifacts: certain strings can be found (or proved absent) faster than the others. However, in average, all records have same depth. And in average, with packed indexes, the need for linear search increases the time required to find the keys with the greater values.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/02/04/index-search-time-depends-on-the-value-being-searched/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Join on overlapping date ranges</title>
		<link>http://explainextended.com/2010/02/01/join-on-overlapping-date-ranges/</link>
		<comments>http://explainextended.com/2010/02/01/join-on-overlapping-date-ranges/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 20:00:02 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4050</guid>
		<description><![CDATA[This post is inspired by a discussion with John Didion:

Is there any way to optimize the query for overlapping ranges in MySQL if both ranges are dynamic?
I have two tables, each with integer range columns (specified as LineString), and I want to find rows that overlap.
No matter what I try, the query planner never uses [...]]]></description>
			<content:encoded><![CDATA[<p>This post is inspired by a discussion with <strong>John Didion</strong>:</p>
<blockquote><p>
Is there any way to optimize the query for <a href="http://explainextended.com/2009/07/01/overlapping-ranges-mysql/">overlapping ranges in <strong>MySQL</strong></a> if both ranges are dynamic?</p>
<p>I have two tables, each with integer range columns (specified as <code>LineString</code>), and I want to find rows that overlap.</p>
<p>No matter what I try, the query planner never uses any indexes.
</p></blockquote>
<p>This question addresses a well-known problem of efficient searching for the intersecting intervals. The queries that deal with it require ability to search for the intervals (stored in two distinct columns) containing a constant scalar value.</p>
<p>Plain <strong>B-Tree</strong> indexes used by most databases do not speed up the queries like that. However, <strong>MySQL</strong> supports <code>SPATIAL</code> indexes that can index two-dimensional shapes and efficiently search for the shapes containing a given point.</p>
<p>With a little effort, time intervals can be converted into the geometrical objects, indexed with a <code>SPATIAL</code> index and searched for the given point in time (also presented as a gemetrical object). This is described in the article about <a href="http://explainextended.com/2009/07/01/overlapping-ranges-mysql/">overlapping ranges in <strong>MySQL</strong></a>.</p>
<p>The query described in that article, however, searches for the intervals overlapping a constant range, provided as a parameter to the query. Now, I will discuss how to adapt the query for a <code>JOIN</code> between two tables.</p>
<p>Let&#8217;s create two sample tables, each containing a set of time ranges stored as geometrical objects, and find all records from both tables whose ranges overlap:<br />
<span id="more-4050"></span><br />
<a href="#" onclick="xcollapse('X7565');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X7565" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_big (
        id INT NOT NULL PRIMARY KEY,
        rg LineString NOT NULL,
        SPATIAL KEY sx_big_rg (rg)
) ENGINE=MyISAM;

CREATE TABLE t_small (
        id INT NOT NULL PRIMARY KEY,
        rg LineString NOT NULL,
        SPATIAL KEY sx_small_rg (rg)
) ENGINE=MyISAM;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(120000);
COMMIT;

INSERT
INTO    t_big (id, rg)
SELECT  id,
        LineString(
        Point(-1, TIMESTAMPDIFF(second, &#039;1970-01-01&#039;, range_start)),
        Point(1, TIMESTAMPDIFF(second, &#039;1970-01-01&#039;, range_end))
        )
FROM    (
        SELECT  id,
                STR_TO_DATE(&#039;2009-02-01&#039;, &#039;%Y-%m-%d&#039;) - INTERVAL id HOUR AS range_start,
                STR_TO_DATE(&#039;2009-02-01&#039;, &#039;%Y-%m-%d&#039;) - INTERVAL id HOUR + INTERVAL RAND(20100201) * 7200 SECOND AS range_end
        FROM    filler
        ) q;

INSERT
INTO    t_small (id, rg)
SELECT  id,
        LineString(
        Point(-1, TIMESTAMPDIFF(second, &#039;1970-01-01&#039;, range_start)),
        Point(1, TIMESTAMPDIFF(second, &#039;1970-01-01&#039;, range_end))
        )
FROM    (
        SELECT  id,
                STR_TO_DATE(&#039;2009-02-01&#039;, &#039;%Y-%m-%d&#039;) - INTERVAL id DAY AS range_start,
                STR_TO_DATE(&#039;2009-02-01&#039;, &#039;%Y-%m-%d&#039;) - INTERVAL id DAY + INTERVAL RAND(20100201) * 172800 SECOND AS range_end
        FROM    filler
        ORDER BY
                id
        LIMIT 5000
        ) q
</pre>
</div>
<p>There are two tables, <code>t_big</code> and <code>t_small</code>.</p>
<p><code>t_big</code> contains <strong>120,000</strong> records with intervals starting each hour and having random length from <code>0</code> to <code>2</code> hours, <strong>1</strong> hour in average.</p>
<p><code>t_small</code> contains <strong>5,000</strong> records with intervals starting each day and having random length from <code>0</code> to <code>2</code> days, <strong>1</strong> day in average.</p>
<p>The intervals are presented as the <code>LineString</code> records with a single line connecting <code>Point(-1, range_start)</code> and <code>Point(1, range_end)</code>. <code>range_start</code> and <code>range_end</code> are presented as the UNIX timestamps (numbers of seconds since <strong>Jan 1st, 1970</strong>). To avoid <a href="http://en.wikipedia.org/wiki/Year_2038_problem">year 2038 problem</a>, we use <code>TIMESTAMPDIFF</code> rather than <code>UNIX_TIMESTAMP</code> to calculate the value.</p>
<p>Each table is indexes with a <code>SPATIAL</code> index on the <code>rg</code> field.</p>
<p>First of all, we need to make sure that the index is created well.</p>
<p><strong>MySQL</strong> uses <a href="http://en.wikipedia.org/wiki/R-tree">R-Tree</a> for the <code>SPATIAL</code> indexes, which works according to the principle of <q>least enlargement</q> to the bounding box when adding a value to the index. Each leaf node is covered by a certain minimum bounding box stored in the branch node, and the branch which needs to be increased to the least extent is chosen as a new parent for the element.</p>
<p>Our ranges are one-dimensional, but <strong>MySQL</strong> requires the spatial data to be two-dimensional. That&#8217;s why we need to use either of the dimensions to represent the time ranges and fill another one to span some constant range so that the box areas would be proportional to the range lengths.</p>
<p>However, if we just fill both ends of the <code>LineString</code> with a constant, this will make the lines pure horizontal or pure vertical. Since they all reside on one line, the minimal bounding boxes of any set of the <code>LineString</code>&#8217;s will have zero area (since they all will be of zero height). <strong>MySQL</strong> will not be able to choose the least enlargement: in a two-dimensional space, any horizontal enlargement of a one-dimensional line will be a zero. <strong>MySQL</strong> will just append the entry to any randomly chosen branch in no certain order. This makes the index very unbalanced and in fact unusable.</p>
<p>To make the index usable, we need to make the boxes have positive area so that the least enlargement principle could do its job. That&#8217;s why we create the <code>LineString</code> fields diagonally, with the first coordinates being <strong>-1</strong> for the start point and <strong>1</strong> for the end point.</p>
<p>Now, we should make a query that would find the intersections of these ranges.</p>
<p><strong>MySQL</strong> documentation <a href="http://dev.mysql.com/doc/refman/5.1/en/using-a-spatial-index.html">evasively mentions</a> that the predicates supported by the spatial indexes are <a href="http://dev.mysql.com/doc/refman/5.0/en/relations-on-geometry-mbr.html#function_mbrcontains"><code>MBRContains</code></a> and <a href="http://dev.mysql.com/doc/refman/5.0/en/relations-on-geometry-mbr.html#function_mbrwithin"><code>MBRWithin</code></a> (with the latter being just the synonym for <code>MBRContains</code> with the argument order reversed), with a side note that <q>in future releases, spatial indexes may also be used for optimizing other functions</q></p>
<p>If we were to comply with the documentation and limit ourselves to using only these two functions for the spatial indexes, the query would still be possible and quite efficient.</p>
<p><code>MBRContains</code> checks if the bounding box of the second argument lies within the bounding box for the first argument. This is not always true for the overlapping ranges. If the ranges overlap partially, that is the first range both starts and ends earlier than the second range, then no ranges lie completely within each other&#8217;s bounds, and <code>MBRContains</code> will return <strong>0</strong> for any argument order.</p>
<p>However, there is a simple condition that checks whether the ranges do overlap.</p>
<p>With any pair of ranges, there is a range that starts later (or at the same time) than another. If the ranges overlap, then the start of that range should lie within the bounds of its counterpart. Indeed, the start of the latest range should be equal to or greater then the start of the earliest range (by definition) and equal to of less than the end of the earliest range (or the ranges do not overlap).</p>
<p>This can be checked with a simple condition: <code>MBRContains(rg1, StartPoint(rg2))</code>.</p>
<p>Since there is no sargable condition that would find out which range starts earlier, we&#8217;ll just need to check both ways.</p>
<p>Unfortunately, <strong>MySQL</strong> does not support merging spatial indexes, so an <code>OR</code> condition would make the indexes unusable. We need to use the two <code>MBRContains</code> predicates with two separate queries and merge the results using <code>UNION</code>.</p>
<p>When optimizing a join, <strong>MySQL</strong>&#8217;s optimizer usually makes the smaller table leading in the join. With a usual equijoin condition this is a wise thing to do, since <strong>MySQL</strong> uses nested loops with an index search to join the tables. The duration of the nested loops query is the product of the leading table scan time (the number of the records) and driven table search time. <strong>B-Tree</strong> search is logarithmic, so <code>(n &times; log m)</code> is less than <code>(m &times; log n)</code> as long as <code>m &lt; n</code>. That&#8217;s why the leading table should be the smallest one.</p>
<p>However, <strong>MySQL</strong> does not take into account that <code>MBRContains</code> predicate is not symmetrical. Reversing the table scan order makes each search operation to be not a searching for boxes containing the point, but searching for start points contained by the box. Since the individual points are not even indexed, <strong>MySQL</strong> will not be able to use an index access path for such a query and will revert to a very inefficient <code>join buffer</code> (which in fact is comparing every record to every other record).</p>
<p>To work around this, we need to force the join order so that the table with the points leads.</p>
<p>In <strong>MySQL</strong>, this is done by replacing <code>JOIN</code> with <code>STRAIGHT_JOIN</code>, which forces the leftmost table to be leading in the join.</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  s.id AS sid, b.id AS bid
        FROM    t_small s
        STRAIGHT_JOIN
                t_big b
        ON      MBRContains(b.rg, StartPoint(s.rg))
        UNION
        SELECT  s.id AS sid, b.id AS bid
        FROM    t_big b
        STRAIGHT_JOIN
                t_small s
        ON      MBRContains(s.rg, StartPoint(b.rg))
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">124911</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (4.4999s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">s</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">5000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">b</td>
<td class="varchar">range</td>
<td class="varchar">sx_big_rg</td>
<td class="varchar">sx_big_rg</td>
<td class="varchar">34</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">12000000.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;2)</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">UNION</td>
<td class="varchar">b</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">120000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">UNION</td>
<td class="varchar">s</td>
<td class="varchar">range</td>
<td class="varchar">sx_small_rg</td>
<td class="varchar">sx_small_rg</td>
<td class="varchar">34</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">500000.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;2)</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union2,3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from (select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` union select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_big` `b` straight_join `20100201_ranges`.`t_small` `s`) `q`
</pre>
<p>This query finds the total number of pairs of overlapping ranges from both tables and runs for <strong>4.5 seconds</strong> which is quite efficient.</p>
<h3>MBRIntersects: a more efficient solution</h3>
<p>Normally I would put the words <q>hope that helps</q> and a link to the question form here, which I use to close the articles with the answers.</p>
<p>But when testing the solution, <strong>John</strong> found out that <code>MBRIntersects</code> function (which just checks intersection of two boxes without splitting them into start and end points) is sargable too and the query above is quite overcomplicated.</p>
<p>This function is symmetric, that is <code>MBRIntersects(a, b) &equiv; MBRIntersects(b, a)</code>. However, <strong>MySQL</strong> is not aware of that, and an index is used only against the column in the first argument to <code>MBRIntersects</code>. This means we have to make <code>t_small</code> to lead in the join and put its column into the second argument of the function.</p>
<p>This way, the column from the driven table (<code>t_big</code>) will be searched for the value provided from the leading table using the index on the driven table, which is exactly what we need.</p>
<p>Here&#8217;s the new updated query:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_small s
STRAIGHT_JOIN
        t_big b
ON      MBRIntersects(b.rg, s.rg)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">124911</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (1.2656s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">s</td>
<td class="varchar">ALL</td>
<td class="varchar">sx_small_rg</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">5000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">b</td>
<td class="varchar">ALL</td>
<td class="varchar">sx_big_rg</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">120000</td>
<td class="double">100.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;2)</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` where intersects(`20100201_ranges`.`b`.`rg`,`20100201_ranges`.`s`.`rg`)
</pre>
<p>As we can see, this query is much more simple and completes <strong>3</strong> times as fast.</p>
<p>Finally, let&#8217;s see how can we retrieve the datetime values of the start and end of the ranges from the <code>LineString</code>&#8217;s:</p>
<pre class="brush: sql">
SELECT  s.id AS small_id,
        &#039;1970-01-01&#039; + INTERVAL Y(StartPoint(s.rg)) SECOND AS small_start,
        &#039;1970-01-01&#039; + INTERVAL Y(EndPoint(s.rg)) SECOND AS small_end,
        b.id AS big_id,
        &#039;1970-01-01&#039; + INTERVAL Y(StartPoint(b.rg)) SECOND AS big_start,
        &#039;1970-01-01&#039; + INTERVAL Y(EndPoint(b.rg)) SECOND AS big_end
FROM    t_small s
STRAIGHT_JOIN
        t_big b
ON      MBRIntersects(b.rg, s.rg)
LIMIT 10
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>small_id</th>
<th>small_start</th>
<th>small_end</th>
<th>big_id</th>
<th>big_start</th>
<th>big_end</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">1</td>
<td class="char">2009-01-31 23:00:00</td>
<td class="char">2009-02-01 00:52:25</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">2</td>
<td class="char">2009-01-31 22:00:00</td>
<td class="char">2009-01-31 22:01:55</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">3</td>
<td class="char">2009-01-31 21:00:00</td>
<td class="char">2009-01-31 21:32:21</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">4</td>
<td class="char">2009-01-31 20:00:00</td>
<td class="char">2009-01-31 20:36:01</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">5</td>
<td class="char">2009-01-31 19:00:00</td>
<td class="char">2009-01-31 20:23:00</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">6</td>
<td class="char">2009-01-31 18:00:00</td>
<td class="char">2009-01-31 19:06:56</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">7</td>
<td class="char">2009-01-31 17:00:00</td>
<td class="char">2009-01-31 18:25:41</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">8</td>
<td class="char">2009-01-31 16:00:00</td>
<td class="char">2009-01-31 17:47:38</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">9</td>
<td class="char">2009-01-31 15:00:00</td>
<td class="char">2009-01-31 15:41:06</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="char">2009-01-31 00:00:00</td>
<td class="char">2009-02-01 20:58:03</td>
<td class="integer">10</td>
<td class="char">2009-01-31 14:00:00</td>
<td class="char">2009-01-31 14:02:38</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0006s (0.0024s)</td>
</tr>
</table>
</div>
<p>This retrieves the range bounds in plain <code>DATETIME</code> format, suitable for output and calculations.</p>
<p>Hope that helps, <strong>John</strong>, and thanks for a nice tip!</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/02/01/join-on-overlapping-date-ranges/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Aggregates: subqueries vs. GROUP BY</title>
		<link>http://explainextended.com/2010/01/30/aggregates-subqueries-vs-group-by/</link>
		<comments>http://explainextended.com/2010/01/30/aggregates-subqueries-vs-group-by/#comments</comments>
		<pubDate>Sat, 30 Jan 2010 20:00:14 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4044</guid>
		<description><![CDATA[From Stack Overflow:
I have a table users and there is a field invited_by_id showing user id of the person who invited this user.
I need to make a MySQL query returning rows with all the fields from users plus a invites_count field showing how many people were invited by each user.
The task seems very simple (and [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2155856/how-to-make-a-nested-query"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>I have a table <code>users</code> and there is a field <code>invited_by_id</code> showing user id of the person who invited this user.</p>
<p>I need to make a <strong>MySQL</strong> query returning rows with all the fields from users plus a <code>invites_count</code> field showing how many people were invited by each user.</p></blockquote>
<p>The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.</p>
<h3>GROUP BY</h3>
<p>The first approach is using <code>GROUP BY</code>:</p>
<pre class="brush: sql">
SELECT  u.*, COUNT(ui.id)
FROM    users u
LEFT JOIN
        users ui
ON      ui.invited_by = u.id
GROUP BY
        u.id
</pre>
<p>This is a <code>GROUP BY</code> with a self join, very simple. There are only two little things I&#8217;d like to pay some attention to.</p>
<p>First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a <code>LEFT JOIN</code>, and use <code>COUNT(ui.id)</code> instead of <code>COUNT(*)</code>, because, due to the very nature of aggregation, <code>COUNT(*)</code> returns at least <strong>1</strong> in a query with <code>GROUP BY</code>, and <code>COUNT(ui.id)</code> skips <code>NULL</code>s (which can only result from a <code>LEFT JOIN</code> miss).</p>
<p>Second, we group by <code>u.id</code> but use <code>u.*</code> in the <code>SELECT</code> clause. Every other engine would fail in this case, but <strong>MySQL</strong> allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).</p>
<p>This behavior is often abused, since <strong>MySQL</strong> does not guarantee what exactly it will return, but it&#8217;s perfectly valid for the queries like ours. We don&#8217;t need to clutter the <code>GROUP BY</code> clause with all fields from <code>users</code> if we have already grouped by the <code>PRIMARY KEY</code> which is already unique. All other values from <code>users</code> are uniquely defined by the <code>PRIMARY KEY</code> so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.</p>
<h3>Subquery</h3>
<p>This solution involves correlated subqueries:</p>
<pre class="brush: sql">
SELECT  u.*,
        (
        SELECT  COUNT(*)
        FROM    users ui
        WHERE   ui.invited_by = u.id
        )
FROM    users u
</pre>
<p>Here, we calculate the <code>COUNT(*)</code> in a correlated subquery. This query returns exactly same records as the previous one.</p>
<h3>Comparison</h3>
<p>Let&#8217;s create two sample tables (a <strong>MyISAM</strong> one and an <strong>InnoDB</strong> one) and see which solution is more efficient for different scenarios:<br />
<span id="more-4044"></span><br />
<a href="#" onclick="xcollapse('X2316');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X2316" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_innodb_user (
        id INT NOT NULL PRIMARY KEY,
        invited_by INT NOT NULL,
        username VARCHAR(30) NOT NULL,
        stuffing VARCHAR(100) NOT NULL,
        KEY ix_user_invitedby (invited_by)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE t_myisam_user (
        id INT NOT NULL PRIMARY KEY,
        invited_by INT NOT NULL,
        username VARCHAR(30) NOT NULL,
        stuffing VARCHAR(100) NOT NULL,
        KEY ix_user_invitedby (invited_by)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_innodb_user
SELECT  id, FLOOR(RAND(20100130) * id), CONCAT(&#039;User &#039;, id), RPAD(&#039;&#039;, 100, &#039;*&#039;)
FROM    filler;

INSERT
INTO    t_myisam_user
SELECT  *
FROM    t_innodb_user;
</pre>
</div>
<p>There are two identical tables, one using <strong>MyISAM</strong> storage engine, the other one using <strong>InnoDB</strong>. Each table contains <strong>100,000</strong> records.</p>
<p>The field <code>invited_by</code> is filled with a random value from <strong>0</strong> (including) to current <code>id</code> (not including). <strong>0</strong> means that the user was not invited (and really, who could invite the first user?).</p>
<h3>All records</h3>
<p>We well calculate the number of members invited by each user. For the sake of brevity we will aggregate the returned records.</p>
<h4>MyISAM</h4>
<p><a href="#" onclick="xcollapse('X3621');return false;"><strong>GROUP BY</strong></a><br />
</p>
<div id="X3621" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_myisam_user uo
        LEFT JOIN
                t_myisam_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100000</td>
<td class="decimal">10000000</td>
<td class="decimal">99986</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (13.9529s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id`) `q`
</pre>
</div>
<p><a href="#" onclick="xcollapse('X9710');return false;"><strong>Subquery</strong></a><br />
</p>
<div id="X9710" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_myisam_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_myisam_user uo
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100000</td>
<td class="decimal">10000000</td>
<td class="decimal">99986</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (2.9531s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100130_aggregate.uo.id&#39; of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo`) `q`
</pre>
</div>
<p>We see that subquery solution is <strong>much</strong> more efficient: <strong>3 seconds</strong> against <strong>14 seconds</strong>, almost <strong>5</strong> times.</p>
<p>What is the reason?</p>
<p>If we look into the query plans we will see that <strong>MySQL</strong> uses <code>temporary</code> and <code>filesort</code> for the <code>GROUP BY</code> query.</p>
<p><strong>MySQL</strong> knows only one way to aggregate the records, namely, sorting. To execute any <code>GROUP BY</code> statement it should first order the records according to <code>GROUP BY</code> conditions. This naturally returns records sorted by <code>GROUP BY</code> expressions and <strong>MySQL</strong> even cared to document this behavior.</p>
<p>There is a <code>PRIMARY KEY</code> index on <code>id</code> field, but traversing the indexes is quite slow in <strong>MySQL</strong>, so the optimizer preferred to do a <code>filesort</code>.</p>
<p>The subqueries, on the other hand, do not require any additional sorting, so retrieving the values aggregated in the subqueries is much faster. The records, of course, will not return in any specific order, but it wasn&#8217;t required anyway.</p>
<h4>InnoDB</h4>
<p><a href="#" onclick="xcollapse('X4269');return false;"><strong>GROUP BY</strong></a><br />
</p>
<div id="X4269" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_innodb_user uo
        LEFT JOIN
                t_innodb_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100000</td>
<td class="decimal">10000000</td>
<td class="decimal">99986</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (4.2812s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">100076</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id`) `q`
</pre>
</div>
<p><a href="#" onclick="xcollapse('X9485');return false;"><strong>Subquery</strong></a><br />
</p>
<div id="X9485" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_innodb_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_innodb_user uo
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100000</td>
<td class="decimal">10000000</td>
<td class="decimal">99986</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (4.9530s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100191</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100130_aggregate.uo.id&#39; of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo`) `q`
</pre>
</div>
<p>With <strong>InnoDB</strong>, both queries complete in almost same time, but the <code>GROUP BY</code> query is still a little bit faster.</p>
<p>Unlike <strong>MyISAM</strong>, with <strong>InnoDB</strong> tables the optimizer chooses the index access path which avoids <strong>GROUP BY</strong> sorting.</p>
<p>This is because <strong>InnoDB</strong> tables are index-organized, and the <code>PRIMARY KEY</code> is the table itself. So the table scan and the <code>PRIMARY KEY</code> scan are in fact the same thing in <strong>InnoDB</strong> and there is no point in additional sorting.</p>
<p>The algorithms behind the <code>LEFT JOIN</code> and the subqueries are in fact the same: just a single index range scan. However, due to some implementation issues, the subquery access requires some additional overhead which makes the subquery to run about <strong>15%</strong> longer than its <code>LEFT JOIN</code> counterpart.</p>
<h3>LIMIT with index</h3>
<p>Let&#8217;s try the same queries, but now just return the first <strong>100</strong> records.</p>
<h4>MyISAM</h4>
<p><a href="#" onclick="xcollapse('X8630');return false;"><strong>GROUP BY</strong></a><br />
</p>
<div id="X8630" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_myisam_user uo
        LEFT JOIN
                t_myisam_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        LIMIT 100
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100</td>
<td class="decimal">10000</td>
<td class="decimal">811</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0003s (0.0315s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` limit 100) `q`
</pre>
</div>
<p><a href="#" onclick="xcollapse('X9759');return false;"><strong>Subquery</strong></a><br />
</p>
<div id="X9759" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_myisam_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_myisam_user uo
        ORDER BY
                uo.id
        LIMIT 100
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100</td>
<td class="decimal">10000</td>
<td class="decimal">811</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (0.0103s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100000.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100130_aggregate.uo.id&#39; of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` order by `20100130_aggregate`.`uo`.`id` limit 100) `q`
</pre>
</div>
<p>In this case, optimizer makes the query to use the index even despite the fact that the table uses <strong>MyISAM</strong>, because <code>LIMIT</code> makes index traversal cheaper than sorting.</p>
<p>However, for some strange reason, the engine still sorts the final resultset (which returns already sorted from the <code>GROUP BY</code>), and the <code>GROUP BY</code> query is <strong>3</strong> times less efficient than the subqueries.</p>
<h4>InnoDB</h4>
<p><a href="#" onclick="xcollapse('X8189');return false;"><strong>GROUP BY</strong></a><br />
</p>
<div id="X8189" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*, COUNT(ui.id) AS cnt
        FROM    t_innodb_user uo
        LEFT JOIN
                t_innodb_user ui
        ON      ui.invited_by = uo.id
        GROUP BY
                uo.id
        LIMIT 100
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100</td>
<td class="decimal">10000</td>
<td class="decimal">811</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (0.0091s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">100076</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` limit 100) `q`
</pre>
</div>
<p><a href="#" onclick="xcollapse('X2608');return false;"><strong>Subquery</strong></a><br />
</p>
<div id="X2608" style="display: none; ">
<pre class="brush: sql">
SELECT  COUNT(*), SUM(LENGTH(stuffing)), SUM(cnt)
FROM    (
        SELECT  uo.*,
                (
                SELECT  COUNT(*)
                FROM    t_innodb_user ui
                WHERE   ui.invited_by = uo.id
                ) AS cnt
        FROM    t_innodb_user uo
        ORDER BY
                uo.id
        LIMIT 100
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(LENGTH(stuffing))</th>
<th>SUM(cnt)</th>
</tr>
<tr>
<td class="bigint">100</td>
<td class="decimal">10000</td>
<td class="decimal">811</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0004s (0.0097s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">100</td>
<td class="double">100076.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100130_aggregate.uo.id&#39; of SELECT #3 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(length(`q`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,sum(`q`.`cnt`) AS `SUM(cnt)` from (select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`invited_by` AS `invited_by`,`20100130_aggregate`.`uo`.`username` AS `username`,`20100130_aggregate`.`uo`.`stuffing` AS `stuffing`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` order by `20100130_aggregate`.`uo`.`id` limit 100) `q`
</pre>
</div>
<p>With <strong>InnoDB</strong> tables, optimizer makes no final <code>filesort</code>, so both solutions take almost the same time (albeit the subqueries are several percent less efficient again).</p>
<h3>LIMIT without index</h3>
<p>Now, let&#8217;s run the queries that limit the resultsets ordered by an unindexed expression. Usually, this means ordering by <code>RAND()</code> to show, say, <strong>10</strong> random users.</p>
<h4>MyISAM</h4>
<p><a href="#" onclick="xcollapse('X10339');return false;"><strong>GROUP BY</strong></a><br />
</p>
<div id="X10339" style="display: none; ">
<pre class="brush: sql">
SELECT  uo.id, uo.username, COUNT(ui.id) AS cnt
FROM    t_myisam_user uo
LEFT JOIN
        t_myisam_user ui
ON      ui.invited_by = uo.id
GROUP BY
        uo.id
ORDER BY
        RAND(20100130)
LIMIT 10
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>username</th>
<th>cnt</th>
</tr>
<tr>
<td class="integer">34429</td>
<td class="varchar">User 34429</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7402</td>
<td class="varchar">User 7402</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7109</td>
<td class="varchar">User 7109</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">99555</td>
<td class="varchar">User 99555</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">38015</td>
<td class="varchar">User 38015</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">33796</td>
<td class="varchar">User 33796</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">10230</td>
<td class="varchar">User 10230</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">87928</td>
<td class="varchar">User 87928</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">83541</td>
<td class="varchar">User 83541</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">68714</td>
<td class="varchar">User 68714</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0013s (4.8593s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">5</td>
<td class="double">2000000.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` left join `20100130_aggregate`.`t_myisam_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` order by rand(20100130) limit 10
</pre>
</div>
<p><a href="#" onclick="xcollapse('X617');return false;"><strong>Subquery</strong></a><br />
</p>
<div id="X617" style="display: none; ">
<pre class="brush: sql">
SELECT  id, username,
        (
        SELECT  COUNT(*)
        FROM    t_myisam_user ui
        WHERE   ui.invited_by = uo.id
        ) AS cnt
FROM    t_myisam_user uo
ORDER BY
        RAND(20100130)
LIMIT 10
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>username</th>
<th>cnt</th>
</tr>
<tr>
<td class="integer">34429</td>
<td class="varchar">User 34429</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7402</td>
<td class="varchar">User 7402</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7109</td>
<td class="varchar">User 7109</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">99555</td>
<td class="varchar">User 99555</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">38015</td>
<td class="varchar">User 38015</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">33796</td>
<td class="varchar">User 33796</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">10230</td>
<td class="varchar">User 10230</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">87928</td>
<td class="varchar">User 87928</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">83541</td>
<td class="varchar">User 83541</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">68714</td>
<td class="varchar">User 68714</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0011s (2.0000s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">uo</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100000</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100130_aggregate.uo.id&#39; of SELECT #2 was resolved in SELECT #1
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_myisam_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_myisam_user` `uo` order by rand(20100130) limit 10
</pre>
</div>
<p>The subqueries are notably faster (<strong>2 seconds</strong> for the subqueries against <strong>4 seconds</strong> for the <code>GROUP BY</code>).</p>
<p>This demonstrates an interesting flaw in <strong>MySQL</strong> optimizer algorithm.</p>
<p>We remember that <strong>MySQL</strong> used the table scan with a sort for a query without a <code>LIMIT</code>, and an index scan for a query using a <code>LIMIT</code>.</p>
<p>In this case, <strong>MySQL</strong>&#8217;s optimizer sees two clauses: <code>GROUP BY</code> and <code>LIMIT</code>, and there is an index avaiable for <code>GROUP BY id</code>.</p>
<p>Taken together, they should make the index access path more efficient, and as we already saw in the previous section, they do, since normally <code>LIMIT</code> just takes first <strong>10</strong> records from the index.</p>
<p>But there is a third clause here, <code>ORDER BY</code>.</p>
<p><code>LIMIT</code> of course gets applied to <code>ORDER BY</code> sorting, not <code>GROUP BY</code> one. <strong>MySQL</strong>&#8217;s optimizer, however, does not take this into account. Despite the fact that the <code>ORDER BY</code> will need the whole recordset, the index access path is still used. This makes <code>GROUP BY</code> to traverse all index. Of course no sorting is done (for <code>GROUP BY</code> that is), but traversal itself is quite slow in <strong>MyISAM</strong>.</p>
<p>The subquery solution, on the other hand, has only the <code>ORDER BY</code> and <code>LIMIT</code>. This makes the optimizer to choose the table scan which is much faster when we need all records.</p>
<h4>InnoDB</h4>
<p><a href="#" onclick="xcollapse('X9048');return false;"><strong>GROUP BY</strong></a><br />
</p>
<div id="X9048" style="display: none; ">
<pre class="brush: sql">
SELECT  uo.id, uo.username, COUNT(ui.id) AS cnt
FROM    t_innodb_user uo
LEFT JOIN
        t_innodb_user ui
ON      ui.invited_by = uo.id
GROUP BY
        uo.id
ORDER BY
        RAND(20100130)
LIMIT 10
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>username</th>
<th>cnt</th>
</tr>
<tr>
<td class="integer">34429</td>
<td class="varchar">User 34429</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7402</td>
<td class="varchar">User 7402</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7109</td>
<td class="varchar">User 7109</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">99555</td>
<td class="varchar">User 99555</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">38015</td>
<td class="varchar">User 38015</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">33796</td>
<td class="varchar">User 33796</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">10230</td>
<td class="varchar">User 10230</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">87928</td>
<td class="varchar">User 87928</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">83541</td>
<td class="varchar">User 83541</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">68714</td>
<td class="varchar">User 68714</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0011s (3.2968s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">uo</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">1000760.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,count(`20100130_aggregate`.`ui`.`id`) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` left join `20100130_aggregate`.`t_innodb_user` `ui` on((`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) where 1 group by `20100130_aggregate`.`uo`.`id` order by rand(20100130) limit 10
</pre>
</div>
<p><a href="#" onclick="xcollapse('X57');return false;"><strong>Subqueries</strong></a><br />
</p>
<div id="X57" style="display: none; ">
<pre class="brush: sql">
SELECT  id, username,
        (
        SELECT  COUNT(*)
        FROM    t_innodb_user ui
        WHERE   ui.invited_by = uo.id
        ) AS cnt
FROM    t_innodb_user uo
ORDER BY
        RAND(20100130)
LIMIT 10
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>username</th>
<th>cnt</th>
</tr>
<tr>
<td class="integer">34429</td>
<td class="varchar">User 34429</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7402</td>
<td class="varchar">User 7402</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">7109</td>
<td class="varchar">User 7109</td>
<td class="bigint">3</td>
</tr>
<tr>
<td class="integer">99555</td>
<td class="varchar">User 99555</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">38015</td>
<td class="varchar">User 38015</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">33796</td>
<td class="varchar">User 33796</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">10230</td>
<td class="varchar">User 10230</td>
<td class="bigint">4</td>
</tr>
<tr>
<td class="integer">87928</td>
<td class="varchar">User 87928</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">83541</td>
<td class="varchar">User 83541</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">68714</td>
<td class="varchar">User 68714</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0010s (4.0155s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">uo</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">100076</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">ix_user_invitedby</td>
<td class="varchar">4</td>
<td class="varchar">20100130_aggregate.uo.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100130_aggregate.uo.id&#39; of SELECT #2 was resolved in SELECT #1
select `20100130_aggregate`.`uo`.`id` AS `id`,`20100130_aggregate`.`uo`.`username` AS `username`,(select count(0) AS `COUNT(*)` from `20100130_aggregate`.`t_innodb_user` `ui` where (`20100130_aggregate`.`ui`.`invited_by` = `20100130_aggregate`.`uo`.`id`)) AS `cnt` from `20100130_aggregate`.`t_innodb_user` `uo` order by rand(20100130) limit 10
</pre>
</div>
<p>Formally, <strong>InnoDB</strong> is subject to the same optimizer mistake as the one described above for <strong>MyISAM</strong>. For the <code>GROUP BY</code>, we see the <code>index</code> access path in the query plan, for subqueries, we see <code>ALL</code>.</p>
<p>But for <strong>InnoDB</strong>, these two access paths are in fact the same, since an <strong>InnoDB</strong> table <em>is</em> a <code>PRIMARY KEY</code> and the index traversal over a <code>PRIMARY KEY</code> is a table traversal.</p>
<p>So unlike <strong>MyISAM</strong>, nothing bad happens in the first case here, and the <code>GROUP BY</code> query runs a little bit faster than the subqueries.</p>
<h3>Summary</h3>
<p>Generally speaking, aggregates over a left-joined table grouped by the main table&#8217;s primary key, and the aggregate subqueries over the same table yield the same results and use almost the same algorithms in <strong>MySQL</strong>.</p>
<p>However, they still differ performance-wise.</p>
<p>On the one hand, the subqueries require some overhead and are executed several percents more slowly than the joins.</p>
<p>On the other hand, <code>GROUP BY</code> in <strong>MySQL</strong> requires sorting the joined recordset on the <code>GROUP BY</code> expressions.</p>
<p>For <strong>InnoDB</strong>, the optimizer mostly makes the optimal decisions and does not sort the recordset, since an <strong>InnoDB</strong> table is always ordered by the <code>PRIMARY KEY</code> and the records naturally come in that order out of the table. The optimizer is aware of that.</p>
<p>For <strong>MyISAM</strong>, the <code>PRIMARY KEY</code> index does not contain the table values. The table needs to be looked up which requires extra work. That&#8217;s why in <strong>MyISAM</strong> the optimizer often makes incorrect decisions about whether or not use the index sort order or sort records taken from the table. These incorrect decisions lead to the subqueries being more efficient.</p>
<p>For <strong>MyISAM</strong> tables, the subqueries are often a better alternative to the <code>GROUP BY</code>.</p>
<p>For <strong>InnoDB</strong> tables, the subqueries and the <code>GROUP BY</code> complete in almost same time, but <code>GROUP BY</code> is still several percent more efficient.</p>
<p>With <strong>InnoDB</strong>, the <code>GROUP BY</code> queries over a left-joined table should be preferred over running the aggregate subqueries.</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/01/30/aggregates-subqueries-vs-group-by/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Latest visitors</title>
		<link>http://explainextended.com/2010/01/12/latest-visitors/</link>
		<comments>http://explainextended.com/2010/01/12/latest-visitors/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 20:00:26 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3972</guid>
		<description><![CDATA[From Stack Overflow:

Say you want to display the latest visitors on a users profile page.
How would you structure this?
Perhaps a table called uservisitors:

uservisitors

userid 
visitorid 
time


And how would you select this with MySQL without any duplicates?
What I mean is if user 1 visits user 2&#8217;s profile, then 5 minutes later visits it again, I don&#8217;t want [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2050955/latest-visitors"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>
Say you want to display the latest visitors on a users profile page.</p>
<p>How would you structure this?</p>
<p>Perhaps a table called <code>uservisitors</code>:</p>
<table class="excel">
<caption>uservisitors</caption>
<tr>
<th>userid </th>
<th>visitorid </th>
<th>time</th>
</tr>
</table>
<p>And how would you select this with <strong>MySQL</strong> without any duplicates?</p>
<p>What I mean is if <strong>user 1</strong> visits <strong>user 2</strong>&#8217;s profile, then <strong>5</strong> minutes later visits it again, I don&#8217;t want to show both entries: only the latest.</p></blockquote>
<p>There are two approaches to this.</p>
<p>First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:</p>
<pre class="brush: sql">
SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 1
GROUP BY
        userid, visitorid
ORDER BY
        lastvisit DESC
LIMIT 5
</pre>
<p>However, there is a little problem with this solution.</p>
<p>Despite the fact that <strong>MySQL</strong> (with proper indexing) uses <code>INDEX FOR GROUP-BY</code> optimization for this query, it will still have to sort on <code>MAX(time)</code> to find <strong>5</strong> latest records.</p>
<p>This will require sorting the whole resultset which will be huge if the service is heavily loaded.</p>
<p>Let&#8217;s test it on a sample table:<br />
<span id="more-3972"></span><br />
<a href="#" onclick="xcollapse('X1258');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X1258" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE uservisitors (
        id INT NOT NULL PRIMARY KEY,
        userid INT NOT NULL,
        visitorid INT NOT NULL,
        time DATETIME NOT NULL,
        KEY ix_uservisitors_userid_visitorid_time (userid, visitorid, time),
        KEY ix_uservisitors_userid_time (userid, time)
) ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    uservisitors
SELECT  id, (id - 1) % 2 + 1, (id - 1) % 100000 + 1, &#039;2009-01-12&#039; - INTERVAL id MINUTE
FROM    filler;
</pre>
</div>
<p>This table has <strong>1,000,000</strong> records for <strong>2</strong> different users with <strong>100,000</strong> unique visitors. This gives <strong>5</strong> visits to each user&#8217;s page per visitor.</p>
<p>This reflects real distribution of the data on an actual site quite well.</p>
<p>The query we initially came up with gives the following results:</p>
<pre class="brush: sql">
SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 2
GROUP BY
        userid, visitorid
ORDER BY
        lastvisit DESC
LIMIT 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>visitorid</th>
<th>lastvisit</th>
</tr>
<tr>
<td class="integer">2</td>
<td class="timestamp">2009-01-11 23:58:00</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="timestamp">2009-01-11 23:56:00</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="timestamp">2009-01-11 23:54:00</td>
</tr>
<tr>
<td class="integer">8</td>
<td class="timestamp">2009-01-11 23:52:00</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="timestamp">2009-01-11 23:50:00</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0002s (1.5469s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">uservisitors</td>
<td class="varchar">range</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">125068</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index for group-by; Using temporary; Using filesort</td>
</tr>
</table>
</div>
<pre>
select `20100112_visitors`.`uservisitors`.`visitorid` AS `visitorid`,max(`20100112_visitors`.`uservisitors`.`time`) AS `lastvisit` from `20100112_visitors`.`uservisitors` where (`20100112_visitors`.`uservisitors`.`userid` = 2) group by `20100112_visitors`.`uservisitors`.`userid`,`20100112_visitors`.`uservisitors`.`visitorid` order by max(`20100112_visitors`.`uservisitors`.`time`) desc limit 5
</pre>
<p>This query uses <code>INDEX FOR GROUP-BY</code>, but it needs to make the sorting <em>twice</em>: first to calculate the aggregates, second to satisfy the <code>ORDER BY</code> clause.</p>
<p>An index scan can be used instead of the first sorting but the second sorting still needs to be done.</p>
<p>That&#8217;s why there is a <code>filesort</code> in the plan.</p>
<p>The query works for <strong>1.5 seconds</strong> which is way too much for a busy service.</p>
<p>But since we only need the latest <strong>5</strong> visits to each user, we can just iterate the records using an index on <code>time</code> and filter out the duplicates until we return <strong>5</strong> records.</p>
<p>This can be done by appending a simple condition into the <code>WHERE</code> clause that would look back in time and find if there already were records with the same values of <code>(userid, visitorid)</code>.</p>
<p>This is of course a variation of Schlemiel the painter&#8217;s algorithm.</p>
<p>However, since we only need <strong>5</strong> records and the duplicate visitors are very improbable, this should work alright.</p>
<p>Let&#8217;s check it:</p>
<pre class="brush: sql">
SELECT  visitorid
FROM    uservisitors uo
WHERE   userid = 1
        AND NOT EXISTS
        (
        SELECT  NULL
        FROM    uservisitors ui
        WHERE   ui.userid = uo.userid
                AND (ui.time, ui.id) &gt; (uo.time, uo.id)
                AND ui.visitorid = uo.visitorid
        )
ORDER BY
        userid DESC, time DESC
LIMIT 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>visitorid</th>
</tr>
<tr>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">9</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0001s (0.0018s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">uo</td>
<td class="varchar">ref</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_time</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">22940</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time</td>
<td class="varchar">8</td>
<td class="varchar">20100112_visitors.uo.userid,20100112_visitors.uo.visitorid</td>
<td class="bigint">4</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100112_visitors.uo.userid&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20100112_visitors.uo.time&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20100112_visitors.uo.id&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20100112_visitors.uo.visitorid&#39; of SELECT #2 was resolved in SELECT #1
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(exists(select NULL AS `NULL` from `20100112_visitors`.`uservisitors` `ui` where ((`20100112_visitors`.`ui`.`userid` = `20100112_visitors`.`uo`.`userid`) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) &gt; (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`)))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5
</pre>
<p>Here we use <code>NOT EXISTS</code> clause to check whether the current visitor had a previous visit (and therefore was already returned by the query). This eliminates duplicates.</p>
<p><code>MySQL</code> is not very good in optimizing the index access for correlated subqueries, that&#8217;s why for each <code>visitorid</code> it checks <em>all</em> records for the given <code>(userid, visitorid)</code> (despite the fact that we have an additional range condition on <code>time</code> which could be employed too). That&#8217;s why we see a <code>USING WHERE</code> in the plan for the <code>DEPENDENT SUBQUERY</code>.</p>
<p>However, since every visitor has only <strong>5</strong> visits to any given <code>userid</code>, this is not much of an issue.</p>
<p>The query completes in <strong>1 ms</strong> which is next to instant.</p>
<h3>Update of Jan 13th, 2010</h3>
<p>As <strong>johno</strong> correctly pointed out, it is possible to use <code>LEFT JOIN / IS NULL</code> or <code>NOT IN</code> instead of <code>NOT EXISTS</code> to get the same results:</p>
<h4>NOT IN</h4>
<pre class="brush: sql">
SELECT  visitorid
FROM    uservisitors uo
WHERE   userid = 1
        AND (userid, visitorid) NOT IN
        (
        SELECT  userid, visitorid
        FROM    uservisitors ui
        WHERE   (ui.time, ui.id) &gt; (uo.time, uo.id)
        )
ORDER BY
        userid DESC, time DESC
LIMIT 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>visitorid</th>
</tr>
<tr>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">9</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0001s (0.0017s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">uo</td>
<td class="varchar">ref</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_time</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">22940</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time</td>
<td class="varchar">8</td>
<td class="varchar">func,func</td>
<td class="bigint">4</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100112_visitors.uo.time&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20100112_visitors.uo.id&#39; of SELECT #2 was resolved in SELECT #1
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` where ((`20100112_visitors`.`uo`.`userid` = 1) and (not(&lt;in_optimizer&gt;((`20100112_visitors`.`uo`.`userid`,`20100112_visitors`.`uo`.`visitorid`),&lt;exists&gt;(select `20100112_visitors`.`ui`.`userid` AS `userid`,`20100112_visitors`.`ui`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `ui` where (((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) &gt; (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)) and (&lt;cache&gt;(`20100112_visitors`.`uo`.`userid`) = `20100112_visitors`.`ui`.`userid`) and (&lt;cache&gt;(`20100112_visitors`.`uo`.`visitorid`) = `20100112_visitors`.`ui`.`visitorid`)) having (&lt;is_not_null_test&gt;(`20100112_visitors`.`ui`.`userid`) and &lt;is_not_null_test&gt;(`20100112_visitors`.`ui`.`visitorid`))))))) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5
</pre>
<h4>LEFT JOIN / IS NULL</h4>
<pre class="brush: sql">
SELECT  uo.visitorid
FROM    uservisitors uo
LEFT JOIN
        uservisitors ui
ON      ui.userid = uo.userid
        AND (ui.time, ui.id) &gt; (uo.time, uo.id)
        AND ui.visitorid = uo.visitorid
WHERE   uo.userid = 1
        AND ui.userid IS NULL
ORDER BY
        uo.userid DESC, uo.time DESC
LIMIT 5
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>visitorid</th>
</tr>
<tr>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">9</td>
</tr>
<tr class="statusbar">
<td colspan="100">5 rows fetched in 0.0001s (0.0018s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">uo</td>
<td class="varchar">ref</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_time</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">22940</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">ui</td>
<td class="varchar">ref</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time,ix_uservisitors_userid_time</td>
<td class="varchar">ix_uservisitors_userid_visitorid_time</td>
<td class="varchar">8</td>
<td class="varchar">const,20100112_visitors.uo.visitorid</td>
<td class="bigint">4</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index; Not exists</td>
</tr>
</table>
</div>
<pre>
select `20100112_visitors`.`uo`.`visitorid` AS `visitorid` from `20100112_visitors`.`uservisitors` `uo` left join `20100112_visitors`.`uservisitors` `ui` on(((`20100112_visitors`.`ui`.`visitorid` = `20100112_visitors`.`uo`.`visitorid`) and (`20100112_visitors`.`ui`.`userid` = 1) and ((`20100112_visitors`.`ui`.`time`,`20100112_visitors`.`ui`.`id`) &gt; (`20100112_visitors`.`uo`.`time`,`20100112_visitors`.`uo`.`id`)))) where ((`20100112_visitors`.`uo`.`userid` = 1) and isnull(`20100112_visitors`.`ui`.`userid`)) order by `20100112_visitors`.`uo`.`userid` desc,`20100112_visitors`.`uo`.`time` desc limit 5
</pre>
<p>As I mentioned in one of the previous articles:</p>
<ul>
<li>
<a href="http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/"><strong>NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL</strong></a>
</li>
</ul>
<p>, <strong>MySQL</strong> optimizes all these three predicates to use the same algorithm (return a <code>FALSE</code> of the first match), but the tests show that <code>LEFT JOIN / IS NULL</code> and <code>NOT IN</code> implement this algorithm a little bit more efficiently than <code>NOT EXISTS</code>.</p>
<p>Though on such a small resultset the difference is negligible, it is always better to use more efficient solutions.</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/01/12/latest-visitors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linked lists in MySQL: multiple ordering</title>
		<link>http://explainextended.com/2009/12/29/linked-lists-in-mysql-multiple-ordering/</link>
		<comments>http://explainextended.com/2009/12/29/linked-lists-in-mysql-multiple-ordering/#comments</comments>
		<pubDate>Tue, 29 Dec 2009 20:00:37 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3894</guid>
		<description><![CDATA[Answering questions asked on the site.
Rick McIntosh asks:

I have two tables, one with a list of spreadsheets (a) and the other with a list of the column headings that show in the spreadsheets (b).

a

id
parent


1
2


2
0



b

id
parent
aid


1
1
1


2
0
1


3
2
1


4
6
2


5
4
2


6
0
2


I want to bring the columns back in first the order of the spreadsheets as defined by the a.parent_id then ordered as [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Rick McIntosh</strong> asks:</p>
<blockquote><p>
I have two tables, one with a list of spreadsheets (<code>a</code>) and the other with a list of the column headings that show in the spreadsheets (<code>b</code>).</p>
<table class="excel">
<caption>a</caption>
<tr>
<th>id</th>
<th>parent</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
</tr>
</table>
<table class="excel">
<caption>b</caption>
<tr>
<th>id</th>
<th>parent</th>
<th>aid</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>4</td>
<td>6</td>
<td>2</td>
</tr>
<tr>
<td>5</td>
<td>4</td>
<td>2</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>2</td>
</tr>
</table>
<p>I want to bring the columns back in first the order of the spreadsheets as defined by the <code>a.parent_id</code> then ordered as <code>b.parent_id</code>:</p>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>aid</th>
<th>aparent</th>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td>6</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>5</td>
<td>4</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>2</td>
</tr>
</table>
</div>
</blockquote>
<p>This can be done using the same recursion technique as the one that was used to build a simple linked list.</p>
<p><strong>MySQL</strong> does not support recursion directly, but it can be emulated using subquery calls in the <code>SELECT</code> clause of the query, using session variables to store the recursion state.</p>
<p>In this case we need to do the following:</p>
<ol>
<li>
<p>Generate a dummy recordset for recursion that would contain as many rows as the resulting recordset. This is best done by issuing a <code>JOIN</code> on the <code>FOREIGN KEY</code>, without any ordering. The values of the recordset will not be used in the actual query.</p>
</li>
<li>
<p>Initialize<code> @a</code> and <code>@b</code> to be the first value of <code>a</code> and a <strong>0</strong>, accordingly.</p>
</li>
<li>
<p>In the loop, make a query that would return the next item of <code>b</code> for the current value of 	<code>@a</code>, or, should it fail, return the first item of <code>b</code> for the next <code>@a</code>. This is best done using a <code>UNION ALL</code> with a <code>LIMIT</code>.</p>
</li>
<li>
<p>Adjust <code>@a</code> so that is points to the correct item: just select the appropriate value from <code>b</code></p>
</li>
</ol>
<p>Let&#8217;s create a sample table:<br />
<span id="more-3894"></span><br />
<a href="#" onclick="xcollapse('X8921');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X8921" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE a (
        id INT NOT NULL PRIMARY KEY,
        parent INT NOT NULL,
        UNIQUE KEY ux_a_parent (parent)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE b (
        id INT NOT NULL PRIMARY KEY,
        parent INT NOT NULL,
        aid INT NOT NULL,
        UNIQUE KEY ux_b_aid_parent (aid, parent)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(5);
COMMIT;

SET @r := 0;

INSERT
INTO    a (parent, id)
SELECT  @r, @r := id
FROM    (
        SELECT  id
        FROM    filler
        ORDER BY
                RAND(20091229)
        ) q;

INSERT
INTO    b (id, parent, aid)
SELECT  id, parent, aid
FROM    (
        SELECT  aid,
                IF((@aid &lt;&gt; aid), 0, @bid) AS parent,
                @bid := id AS id,
                @aid := aid
        FROM    (
                SELECT  @aid := 0,
                        @bid := 0,
                        @parent := 0
                ) vars,
                (
                SELECT  a.id AS aid, (a.id - 1) * 3 + q.id AS id
                FROM    a
                JOIN    (
                        SELECT  id
                        FROM    a
                        LIMIT   3
                        ) q
                ORDER BY
                        aid, RAND(20091228 &lt;&lt; 1)
                ) q2
        ) q3;
</pre>
</div>
<p>Here&#8217;s what this table contains (in <code>id</code> order):</p>
<pre class="brush: sql">
SELECT  *
FROM    a
JOIN    b
ON      b.aid = a.id
ORDER BY
        a.id, b.id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>id</th>
<th>parent</th>
<th>aid</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="integer">0</td>
<td class="integer">1</td>
<td class="integer">2</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="integer">0</td>
<td class="integer">2</td>
<td class="integer">3</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="integer">0</td>
<td class="integer">3</td>
<td class="integer">0</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="integer">4</td>
<td class="integer">4</td>
<td class="integer">5</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="integer">4</td>
<td class="integer">5</td>
<td class="integer">0</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="integer">4</td>
<td class="integer">6</td>
<td class="integer">4</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">1</td>
<td class="integer">7</td>
<td class="integer">8</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">1</td>
<td class="integer">8</td>
<td class="integer">9</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">1</td>
<td class="integer">9</td>
<td class="integer">0</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="integer">3</td>
<td class="integer">10</td>
<td class="integer">12</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="integer">3</td>
<td class="integer">11</td>
<td class="integer">0</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="integer">3</td>
<td class="integer">12</td>
<td class="integer">11</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="integer">2</td>
<td class="integer">13</td>
<td class="integer">0</td>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="integer">2</td>
<td class="integer">14</td>
<td class="integer">15</td>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="integer">2</td>
<td class="integer">15</td>
<td class="integer">13</td>
<td class="integer">5</td>
</tr>
<tr class="statusbar">
<td colspan="100">15 rows fetched in 0.0006s (0.0029s)</td>
</tr>
</table>
</div>
<p>There are <strong>5</strong> records in <code>a</code> and <strong>15</strong> records in <code>b</code> (<strong>3</strong> records per item).</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  b.id, b.parent, a.id AS aid, a.parent AS aparent
FROM    (
        SELECT  @a AS _a,
                @b AS _b,
                @b :=
                (
                SELECT  id
                FROM    b
                WHERE   b.aid = _a
                        AND b.parent = _b
                UNION ALL
                SELECT  b.id
                FROM    a
                JOIN    b
                ON      b.aid = a.id
                WHERE   b.parent = 0
                        AND a.parent = _a
                LIMIT 1
                ) AS bid,
                @b AS _nb,
                @a :=
                (
                SELECT  aid
                FROM    b
                WHERE   b.id = _nb
                ) AS aid
        FROM    (
                SELECT  @a := a.id, @b := 0
                FROM    a
                WHERE   parent = 0
                ) vars,
                a
        JOIN    b
        ON      b.aid = a.id
) q
JOIN    a
ON      a.id = aid
JOIN    b
ON      b.id = bid
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>aid</th>
<th>aparent</th>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">0</td>
<td class="integer">1</td>
<td class="integer">0</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="integer">3</td>
<td class="integer">1</td>
<td class="integer">0</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="integer">2</td>
<td class="integer">1</td>
<td class="integer">0</td>
</tr>
<tr>
<td class="integer">9</td>
<td class="integer">0</td>
<td class="integer">3</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">8</td>
<td class="integer">9</td>
<td class="integer">3</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="integer">8</td>
<td class="integer">3</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">11</td>
<td class="integer">0</td>
<td class="integer">4</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">12</td>
<td class="integer">11</td>
<td class="integer">4</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="integer">12</td>
<td class="integer">4</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="integer">0</td>
<td class="integer">2</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="integer">5</td>
<td class="integer">2</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="integer">4</td>
<td class="integer">2</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">13</td>
<td class="integer">0</td>
<td class="integer">5</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">15</td>
<td class="integer">13</td>
<td class="integer">5</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">14</td>
<td class="integer">15</td>
<td class="integer">5</td>
<td class="integer">2</td>
</tr>
<tr class="statusbar">
<td colspan="100">15 rows fetched in 0.0006s (0.0073s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">a</td>
<td class="varchar">index</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">ux_a_parent</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">5</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">15</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using join buffer</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">b</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.bid</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived6&gt;</td>
<td class="varchar">system</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">a</td>
<td class="varchar">index</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">5</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">b</td>
<td class="varchar">ref</td>
<td class="varchar">ux_b_aid_parent</td>
<td class="varchar">ux_b_aid_parent</td>
<td class="varchar">4</td>
<td class="varchar">20091229_linked.a.id</td>
<td class="bigint">7</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">DERIVED</td>
<td class="varchar">a</td>
<td class="varchar">const</td>
<td class="varchar">ux_a_parent</td>
<td class="varchar">ux_a_parent</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">b</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">b</td>
<td class="varchar">eq_ref</td>
<td class="varchar">ux_b_aid_parent</td>
<td class="varchar">ux_b_aid_parent</td>
<td class="varchar">8</td>
<td class="varchar">func,func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DEPENDENT UNION</td>
<td class="varchar">a</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ux_a_parent</td>
<td class="varchar">ux_a_parent</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DEPENDENT UNION</td>
<td class="varchar">b</td>
<td class="varchar">eq_ref</td>
<td class="varchar">ux_b_aid_parent</td>
<td class="varchar">ux_b_aid_parent</td>
<td class="varchar">8</td>
<td class="varchar">20091229_linked.a.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union3,4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;_a&#39; of SELECT #3 was resolved in SELECT #2
Field or reference &#39;_b&#39; of SELECT #3 was resolved in SELECT #2
Field or reference &#39;_a&#39; of SELECT #4 was resolved in SELECT #2
Field or reference &#39;_nb&#39; of SELECT #5 was resolved in SELECT #2
select `20091229_linked`.`b`.`id` AS `id`,`20091229_linked`.`b`.`parent` AS `parent`,`20091229_linked`.`a`.`id` AS `aid`,`20091229_linked`.`a`.`parent` AS `aparent` from (select (@a) AS `_a`,(@b) AS `_b`,(@b:=(select `20091229_linked`.`b`.`id` AS `id` from `20091229_linked`.`b` where ((`20091229_linked`.`b`.`aid` = `_a`) and (`20091229_linked`.`b`.`parent` = `_b`)) union all select `20091229_linked`.`b`.`id` AS `id` from `20091229_linked`.`a` join `20091229_linked`.`b` where () limit 1)) AS `bid`,(@b) AS `_nb`,(@a:=(select `20091229_linked`.`b`.`aid` AS `aid` from `20091229_linked`.`b` where (`20091229_linked`.`b`.`id` = `_nb`))) AS `aid` from (select (@a:=&#39;1&#39;) AS `@a := a.id`,(@b:=0) AS `@b := 0` from `20091229_linked`.`a` where 1) `vars` join `20091229_linked`.`a` join `20091229_linked`.`b` where (`20091229_linked`.`b`.`aid` = `20091229_linked`.`a`.`id`)) `q` join `20091229_linked`.`a` join `20091229_linked`.`b` where ((`20091229_linked`.`b`.`id` = `q`.`bid`) and (`20091229_linked`.`a`.`id` = `q`.`aid`))
</pre>
<p>This returns all rows in correct order.</p>
<p>Note that I used aliases (<code>_a</code>, <code>_b</code> and <code>_nb</code>) instead of variables in the subqueries. This helps <strong>MySQL</strong> to pick correct execution plans and use the indexes: a predicate containing a variable is an <code>UNCACHEABLE SUBQUERY</code>, while a predicate with an alias is a <code>DEPENDENT SUBQUERY</code> which is optimized much better.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2009/12/29/linked-lists-in-mysql-multiple-ordering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MySQL: Selecting records holding group-wise maximum (resolving ties)</title>
		<link>http://explainextended.com/2009/11/25/mysql-selecting-records-holding-group-wise-maximum-resolving-ties/</link>
		<comments>http://explainextended.com/2009/11/25/mysql-selecting-records-holding-group-wise-maximum-resolving-ties/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 20:00:46 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3759</guid>
		<description><![CDATA[Continuing the series on selecting records holding group-wise maximums.
The approach shown in the previous article is quite efficient, however, it can only be relied upon if the column being maximized is unique.
If the column is not unique (even along with the grouping column), the ties are possible: multiple records can satisfy the condition of holding [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing the series on <a href="/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/">selecting records holding group-wise maximums</a>.</p>
<p>The approach shown in the previous article is quite efficient, however, it can only be relied upon if the column being maximized is unique.</p>
<p>If the column is not unique (even along with the grouping column), the ties are possible: multiple records can satisfy the condition of <q>holding a group-wise maximum</q>.</p>
<p>In this article, I&#8217;ll show how to resolve the ties.</p>
<p>Let&#8217;s recreate the table we used in the previous article:<br />
<span id="more-3759"></span><br />
<a href="#" onclick="xcollapse('X1682');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X1682" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_distinct (
      id INT NOT NULL PRIMARY KEY,
      orderer INT NOT NULL,
      glow INT NOT NULL,
      ghigh INT NOT NULL,
      stuffing VARCHAR(200) NOT NULL,
      KEY ix_distinct_glow_id (glow, id),
      KEY ix_distinct_ghigh_id (ghigh, id),
      KEY ix_distinct_glow_orderer (glow, orderer),
      KEY ix_distinct_ghigh_orderer (ghigh, orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_distinct (id, orderer, glow, ghigh, stuffing)
SELECT  id, FLOOR(RAND(20091125) * 9) + 1,
        (id - 1) % 10 + 1,
        (id - 1) % 10000 + 1,
        LPAD(&#039;&#039;, 200, &#039;*&#039;)
FROM    filler;
</pre>
</div>
<p>This table has <strong>1,000,000</strong> records:</p>
<ul>
<li><code>id</code> is the <code>PRIMARY KEY</code></li>
<li><code>orderer</code> is filled with random values from <strong>1</strong> to <strong>10</strong></li>
<li><code>glow</code> is a low cardinality grouping field (<strong>10</strong> distinct values)</li>
<li><code>ghigh</code> is a high cardinality grouping field (<strong>10,000</strong> distinct values)</li>
<li><code>stuffing</code> is an asterisk-filled <code>VARCHAR(200)</code> column added to emulate payload of the actual tables</li>
</ul>
<p>Now, let&#8217;s see how we deal with the different methods to resolve ties:</p>
<h3>Returning all ties</h3>
<p>Some tasks require all ties to be returned (that is, we need to return every record that holds the group-wise maximum or minimum).</p>
<p>This is simple and requires just a little modification to the query we used before. Instead of joining on the <code>id</code>, we just should join on the <code>MAX</code> or <code>MIN</code> value returned by the grouping query. This guarantees that every tie will be returned.</p>
<p>Here&#8217;s the query to return all group-wise maximums with ties for a high cardinality grouper:</p>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(id) AS idsum
FROM    (
        SELECT  ghigh, MAX(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                ghigh
        ) dd
JOIN    t_distinct di
ON      di.ghigh = dd.ghigh
        AND di.orderer = dd.orderer
</pre>
<p><a href="#" onclick="xcollapse('X7623');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X7623" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>idsum</th>
</tr>
<tr>
<td class="bigint">111431</td>
<td class="decimal">55649850278</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.4429s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_ghigh_id,ix_distinct_ghigh_orderer</td>
<td class="varchar">ix_distinct_ghigh_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.ghigh,dd.orderer</td>
<td class="bigint">5</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_ghigh_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19609</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `cnt`,sum(`20091125_groupwise`.`di`.`id`) AS `idsum` from (select `20091125_groupwise`.`d`.`ghigh` AS `ghigh`,max(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`ghigh`) `dd` join `20091125_groupwise`.`t_distinct` `di` where ((`20091125_groupwise`.`di`.`orderer` = `dd`.`orderer`) and (`20091125_groupwise`.`di`.`ghigh` = `dd`.`ghigh`))
</pre>
</div>
<p>, and the same query for the low cardinality groupers:</p>
<pre class="brush: sql">
SELECT  COUNT(*) AS cnt, SUM(id) AS idsum
FROM    (
        SELECT  glow, MAX(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.glow = dd.glow
        AND di.orderer = dd.orderer
</pre>
<p><a href="#" onclick="xcollapse('X1241');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1241" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>cnt</th>
<th>idsum</th>
</tr>
<tr>
<td class="bigint">111431</td>
<td class="decimal">55649850278</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (0.1075s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_glow_id,ix_distinct_glow_orderer</td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.glow,dd.orderer</td>
<td class="bigint">3289</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `cnt`,sum(`20091125_groupwise`.`di`.`id`) AS `idsum` from (select `20091125_groupwise`.`d`.`glow` AS `glow`,max(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`glow`) `dd` join `20091125_groupwise`.`t_distinct` `di` where ((`20091125_groupwise`.`di`.`orderer` = `dd`.`orderer`) and (`20091125_groupwise`.`di`.`glow` = `dd`.`glow`))
</pre>
</div>
<p>The final results are aggregated to avoid the post bloating, but the principle is the same.</p>
<p>We see that the queries return the same results but have different execution times.</p>
<p>Due to the way the data are distributed in the table, each grouper (both <code>ghigh</code> and <code>glow</code>) has at least one record with <code>orderer = 1</code>. The group-wise minimum for each row is, therefore, <strong>1</strong> and the queries above, in fact, just return all records with <code>orderer = 1</code>.</p>
<p><strong>MySQL</strong> does all joins using the nested loops.</p>
<p>To return the values for high cardinality groupers (those that the table has many) it should do <strong>10,000</strong> loose index scans to find the group-wise minimums, each followed by <strong>10,000</strong> tight index scans returning all records with <code>orderer = 1</code> (about <strong>11</strong> records per scan in average). This results in <strong>111,431</strong> records being returned.</p>
<p>The query against the low cardinality groupers does the same, but it performs <strong>10</strong> loose index scans to find the minimums and does <strong>10</strong> tight index scans (one per grouper) to find the records that hold it, returning <strong>11,111</strong> records per group.</p>
<p>Tight index scans are sequential and loose index scan require searching the <strong>B-Tree</strong>. Finding lots of sequential values in few groups is much faster than finding few sequential values in lots of groups, though both algorithms ultimately return the same records.</p>
<p>That&#8217;s why the query against the high cardinality groupers (lots of groups, few records per group) takes almost <strong>450 ms</strong>, while the query against low cardinality groupers (few groups, lots of records per group) takes as few as <strong>107 ms</strong>, or <strong>4</strong> times as fast.</p>
<h3>Returning arbitrary record</h3>
<p>In some occasions, we are not interested in the order of records within the tie: any single record would do as long as it holds the group-wise minimum or maximum.</p>
<p>This way we need to return any arbitrary record holding the group-wise maximum.</p>
<p>To do this, we can use the <a href="http://dev.mysql.com/doc/refman/5.1/en/group-by-hidden-columns.html"><strong>MySQL</strong>&#8217;s extension</a> which allows to use the non-grouped and non-aggregates columns in a <code>GROUP BY</code> query.</p>
<p>When used this way, the expression will return an <em>arbitrary</em> value taken from any record within the group. This was designed to make the <code>GROUP BY</code> queries applied to joins, when both joining and grouping are done on the <code>PRIMARY KEY</code> of a table participating in the query. The join will duplicate the rows from the table and the <code>GROUP BY</code> will split them across the groups according to the value of the <code>PRIMARY KEY</code>. Since the <code>PRIMARY KEY</code> uniquely defines a row, all columns of the table are guaranteed to have the same values across the group, and there is no need to select any special one: any will do.</p>
<p>This could be applied to out task as well:</p>
<pre class="brush: sql">
SELECT  COUNT(*), SUM(id), MAX(orderer), MIN(orderer)
FROM    (
        SELECT  di.*
        FROM    (
                SELECT  ghigh, MIN(orderer) AS orderer
                FROM    t_distinct d
                GROUP BY
                        ghigh
                ) dd
        JOIN    t_distinct di
        ON      di.ghigh = dd.ghigh
                AND di.orderer = dd.orderer
        GROUP BY
                di.ghigh, di.orderer
        ) q
</pre>
<p><a href="#" onclick="xcollapse('X5695');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X5695" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(id)</th>
<th>MAX(orderer)</th>
<th>MIN(orderer)</th>
</tr>
<tr>
<td class="bigint">10000</td>
<td class="decimal">850675000</td>
<td class="integer">1</td>
<td class="integer">1</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (0.5600s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">di</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_ghigh_id,ix_distinct_ghigh_orderer</td>
<td class="varchar">ix_distinct_ghigh_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.ghigh,dd.orderer</td>
<td class="bigint">5</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_ghigh_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19609</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)`,sum(`q`.`id`) AS `SUM(id)`,max(`q`.`orderer`) AS `MAX(orderer)`,min(`q`.`orderer`) AS `MIN(orderer)` from (select `20091125_groupwise`.`di`.`id` AS `id`,`20091125_groupwise`.`di`.`orderer` AS `orderer`,`20091125_groupwise`.`di`.`glow` AS `glow`,`20091125_groupwise`.`di`.`ghigh` AS `ghigh`,`20091125_groupwise`.`di`.`stuffing` AS `stuffing` from (select `20091125_groupwise`.`d`.`ghigh` AS `ghigh`,min(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`ghigh`) `dd` join `20091125_groupwise`.`t_distinct` `di` where ((`20091125_groupwise`.`di`.`orderer` = `dd`.`orderer`) and (`20091125_groupwise`.`di`.`ghigh` = `dd`.`ghigh`)) group by `20091125_groupwise`.`di`.`ghigh`,`20091125_groupwise`.`di`.`orderer`) `q`
</pre>
</div>
<p>This solution is quite efficient for the high-cardinality groupers, but has a serious drawback.</p>
<p>The expressions returned by the <code>GROUP BY</code> extension are not guaranteed to originate from one record.</p>
<p>Imagine this query:</p>
<pre class="brush: sql">
SELECT  grouper, column1, column2
FROM    mytable
GROUP BY
         grouper
</pre>
<p>against this table:</p>
<table class="excel">
<tr>
<th>grouper</th>
<th>column1</th>
<th>column2</th>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>100</td>
</tr>
<tr>
<td>1</td>
<td>20</td>
<td>200</td>
</tr>
</table>
<p>This query is free to return <em>any</em> value of <code>column1</code> and <code>column2</code> within the group.</p>
<p>That means that this resultset:</p>
<div class="terminal">
<table class="terminal">
<tr>
<th>grouper</th>
<th>column1</th>
<th>column2</th>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>200</td>
</tr>
</table>
</div>
<p>would be a correct output for this query, with the values of <code>column1</code> and <code>column2</code> being taken from two different records (within a single group).</p>
<p>Though in reality <strong>MySQL</strong> returns the values from a single record, it&#8217;s a bad practice to rely on this behavior which can be changed with any new version or even a bugfix patch.</p>
<p>However, we can use a cleaner version of the query which would do just the same.</p>
<p>Instead of grouping on the non-primary key values, we can just select the primary key of an arbitrary record and join the table back on it.</p>
<p>To select a primary key of an arbitrary record we can just use a subquery with a <code>LIMIT 1</code> applied to it.</p>
<p>Here&#8217;s the query for the high cardinality grouper:</p>
<pre class="brush: sql">
SELECT  COUNT(*), SUM(id), MAX(orderer), MIN(orderer)
FROM    (
        SELECT  di.*
        FROM    (
                SELECT  ghigh, MIN(orderer) AS orderer
                FROM    t_distinct d
                GROUP BY
                        ghigh
                ) dd
        JOIN    t_distinct di
        ON      di.id =
                (
                SELECT  id
                FROM    t_distinct ds
                WHERE   ds.ghigh = dd.ghigh
                        AND ds.orderer = dd.orderer
                LIMIT 1
                )
        ) q
</pre>
<p><a href="#" onclick="xcollapse('X3112');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X3112" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
<th>SUM(id)</th>
<th>MAX(orderer)</th>
<th>MIN(orderer)</th>
</tr>
<tr>
<td class="bigint">10000</td>
<td class="decimal">850675000</td>
<td class="integer">1</td>
<td class="integer">1</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.5781s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ds</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_ghigh_id,ix_distinct_ghigh_orderer</td>
<td class="varchar">ix_distinct_ghigh_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.ghigh,dd.orderer</td>
<td class="bigint">5</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_ghigh_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19609</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;dd.ghigh&#39; of SELECT #4 was resolved in SELECT #2
Field or reference &#39;dd.orderer&#39; of SELECT #4 was resolved in SELECT #2
select count(0) AS `COUNT(*)`,sum(`q`.`id`) AS `SUM(id)`,max(`q`.`orderer`) AS `MAX(orderer)`,min(`q`.`orderer`) AS `MIN(orderer)` from (select `20091125_groupwise`.`di`.`id` AS `id`,`20091125_groupwise`.`di`.`orderer` AS `orderer`,`20091125_groupwise`.`di`.`glow` AS `glow`,`20091125_groupwise`.`di`.`ghigh` AS `ghigh`,`20091125_groupwise`.`di`.`stuffing` AS `stuffing` from (select `20091125_groupwise`.`d`.`ghigh` AS `ghigh`,min(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`ghigh`) `dd` join `20091125_groupwise`.`t_distinct` `di` where (`20091125_groupwise`.`di`.`id` = (select `20091125_groupwise`.`ds`.`id` AS `id` from `20091125_groupwise`.`t_distinct` `ds` where ((`20091125_groupwise`.`ds`.`ghigh` = `dd`.`ghigh`) and (`20091125_groupwise`.`ds`.`orderer` = `dd`.`orderer`)) limit 1))) `q`
</pre>
</div>
<p>and the same query for the low cardinality grouper:</p>
<pre class="brush: sql">
SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MIN(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id =
        (
        SELECT  id
        FROM    t_distinct ds
        WHERE   ds.glow = dd.glow
                AND ds.orderer = dd.orderer
        LIMIT 1
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>ghigh</th>
<th>glow</th>
</tr>
<tr>
<td class="integer">11</td>
<td class="integer">1</td>
<td class="integer">11</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">382</td>
<td class="integer">1</td>
<td class="integer">382</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">1</td>
<td class="integer">3</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">44</td>
<td class="integer">1</td>
<td class="integer">44</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">55</td>
<td class="integer">1</td>
<td class="integer">55</td>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">26</td>
<td class="integer">1</td>
<td class="integer">26</td>
<td class="integer">6</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="integer">1</td>
<td class="integer">7</td>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">118</td>
<td class="integer">1</td>
<td class="integer">118</td>
<td class="integer">8</td>
</tr>
<tr>
<td class="integer">19</td>
<td class="integer">1</td>
<td class="integer">19</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">150</td>
<td class="integer">1</td>
<td class="integer">150</td>
<td class="integer">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.0032s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X1062');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1062" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ds</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_glow_id,ix_distinct_glow_orderer</td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.glow,dd.orderer</td>
<td class="bigint">3289</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;dd.glow&#39; of SELECT #3 was resolved in SELECT #1
Field or reference &#39;dd.orderer&#39; of SELECT #3 was resolved in SELECT #1
select `20091125_groupwise`.`di`.`id` AS `id`,`20091125_groupwise`.`di`.`orderer` AS `orderer`,`20091125_groupwise`.`di`.`ghigh` AS `ghigh`,`20091125_groupwise`.`di`.`glow` AS `glow` from (select `20091125_groupwise`.`d`.`glow` AS `glow`,min(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`glow`) `dd` join `20091125_groupwise`.`t_distinct` `di` where (`20091125_groupwise`.`di`.`id` = (select `20091125_groupwise`.`ds`.`id` AS `id` from `20091125_groupwise`.`t_distinct` `ds` where ((`20091125_groupwise`.`ds`.`glow` = `dd`.`glow`) and (`20091125_groupwise`.`ds`.`orderer` = `dd`.`orderer`)) limit 1))
</pre>
</div>
<p>The low cardinality query returns much less columns and is almost instant (<strong>3 ms</strong>).</p>
<p>The high cardinality columns requires more index lookups that the previous <code>GROUP BY</code> version (to find the group-wise minimums; to find the <code>id</code> of an arbitrary record; to join the table back on this <code>id</code>). However, it does not use <code>filesort</code>.</p>
<p>This makes it just a little bit less efficient than a <code>GROUP BY</code> query (<strong>570 ms</strong> vs <strong>560 ms</strong>), but it is guaranteed to return correct results, and, hence, should be used instead of the <code>GROUP BY</code> version.</p>
<h3>Returning a record with maximum <code>id</code></h3>
<p>Finally, let&#8217;s make a query that would resolve the ties by returning the column with the maximum or the minimum <code>id</code> (or another unique column).</p>
<p>In this case we cannot get away with only using a <code>MIN</code> or a <code>MAX</code>, since we do need to sort on a composite condition, and <code>MAX</code> and <code>MIN</code> only accept and return a single value.</p>
<p>To do this, we need to use a subquery returning an <code>id</code>, just like we did in the previous case. The only thing we need to change is to add a correct <code>ORDER BY</code> condition to this subquery.</p>
<p>To make it more interesting, let&#8217;s create a query that would return the records holding the group-wise <strong>minimums</strong> of <code>orderer</code>, but would resolve ties by selecting a <strong>maximal</strong> <code>id</code>.</p>
<pre class="brush: sql">
SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MIN(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id =
        (
        SELECT  id
        FROM    t_distinct ds
        WHERE   ds.glow = dd.glow
                AND ds.orderer = dd.orderer
        ORDER BY
                id DESC
        LIMIT 1
        )
</pre>
<p><a href="#" onclick="xcollapse('X10260');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X10260" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>ghigh</th>
<th>glow</th>
</tr>
<tr>
<td class="integer">999881</td>
<td class="integer">1</td>
<td class="integer">9881</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">999872</td>
<td class="integer">1</td>
<td class="integer">9872</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">999853</td>
<td class="integer">1</td>
<td class="integer">9853</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">999854</td>
<td class="integer">1</td>
<td class="integer">9854</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">999785</td>
<td class="integer">1</td>
<td class="integer">9785</td>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">999986</td>
<td class="integer">1</td>
<td class="integer">9986</td>
<td class="integer">6</td>
</tr>
<tr>
<td class="integer">999997</td>
<td class="integer">1</td>
<td class="integer">9997</td>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">999928</td>
<td class="integer">1</td>
<td class="integer">9928</td>
<td class="integer">8</td>
</tr>
<tr>
<td class="integer">999999</td>
<td class="integer">1</td>
<td class="integer">9999</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">999950</td>
<td class="integer">1</td>
<td class="integer">9950</td>
<td class="integer">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.3807s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ds</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_glow_id,ix_distinct_glow_orderer</td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.glow,dd.orderer</td>
<td class="bigint">3289</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;dd.glow&#39; of SELECT #3 was resolved in SELECT #1
Field or reference &#39;dd.orderer&#39; of SELECT #3 was resolved in SELECT #1
select `20091125_groupwise`.`di`.`id` AS `id`,`20091125_groupwise`.`di`.`orderer` AS `orderer`,`20091125_groupwise`.`di`.`ghigh` AS `ghigh`,`20091125_groupwise`.`di`.`glow` AS `glow` from (select `20091125_groupwise`.`d`.`glow` AS `glow`,min(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`glow`) `dd` join `20091125_groupwise`.`t_distinct` `di` where (`20091125_groupwise`.`di`.`id` = (select `20091125_groupwise`.`ds`.`id` AS `id` from `20091125_groupwise`.`t_distinct` `ds` where ((`20091125_groupwise`.`ds`.`glow` = `dd`.`glow`) and (`20091125_groupwise`.`ds`.`orderer` = `dd`.`orderer`)) order by `20091125_groupwise`.`ds`.`id` desc limit 1))
</pre>
</div>
<p>This works, but is quite inefficient (<strong>380 ms</strong>).</p>
<p>This is due to the flaw in <strong>MySQL</strong>&#8217;s optimizer.</p>
<p>As you can see, the subquery has a filtering equality condition on two columns (<code>glow</code> and <code>orderer</code>) and an ordering condition on the third column, <code>id</code>.</p>
<p>There is an index that covers all three columns: <code>ix_distinct_glow_orderer (glow, orderer)</code>. The first two columns are explicitly provided in the index definition, and the third column, <code>id</code>, being a <code>PRIMARY KEY</code>, is implicitly included by the <strong>InnoDB</strong> engine as a row pointer.</p>
<p>However, <strong>MySQL</strong> is able to use this index for ordering in a plain query but not in a subquery. It does use the index for filtering but orders using a <code>filesort</code>.</p>
<p>To work around this, we need to modify the ordering condition so that it would include <strong>all three</strong> columns constituting the index, all in the same direction:</p>
<pre class="brush: sql">
SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MIN(orderer) AS orderer
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id =
        (
        SELECT  id
        FROM    t_distinct ds
        WHERE   ds.glow = dd.glow
                AND ds.orderer = dd.orderer
        ORDER BY
                glow DESC, orderer DESC, id DESC
        LIMIT 1
        )
</pre>
<p><a href="#" onclick="xcollapse('X5648');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X5648" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>ghigh</th>
<th>glow</th>
</tr>
<tr>
<td class="integer">999881</td>
<td class="integer">1</td>
<td class="integer">9881</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">999872</td>
<td class="integer">1</td>
<td class="integer">9872</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">999853</td>
<td class="integer">1</td>
<td class="integer">9853</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">999854</td>
<td class="integer">1</td>
<td class="integer">9854</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">999785</td>
<td class="integer">1</td>
<td class="integer">9785</td>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">999986</td>
<td class="integer">1</td>
<td class="integer">9986</td>
<td class="integer">6</td>
</tr>
<tr>
<td class="integer">999997</td>
<td class="integer">1</td>
<td class="integer">9997</td>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">999928</td>
<td class="integer">1</td>
<td class="integer">9928</td>
<td class="integer">8</td>
</tr>
<tr>
<td class="integer">999999</td>
<td class="integer">1</td>
<td class="integer">9999</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">999950</td>
<td class="integer">1</td>
<td class="integer">9950</td>
<td class="integer">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.0033s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">func</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">ds</td>
<td class="varchar">ref</td>
<td class="varchar">ix_distinct_glow_id,ix_distinct_glow_orderer</td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">8</td>
<td class="varchar">dd.glow,dd.orderer</td>
<td class="bigint">3289</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_glow_orderer</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;dd.glow&#39; of SELECT #3 was resolved in SELECT #1
Field or reference &#39;dd.orderer&#39; of SELECT #3 was resolved in SELECT #1
select `20091125_groupwise`.`di`.`id` AS `id`,`20091125_groupwise`.`di`.`orderer` AS `orderer`,`20091125_groupwise`.`di`.`ghigh` AS `ghigh`,`20091125_groupwise`.`di`.`glow` AS `glow` from (select `20091125_groupwise`.`d`.`glow` AS `glow`,min(`20091125_groupwise`.`d`.`orderer`) AS `orderer` from `20091125_groupwise`.`t_distinct` `d` group by `20091125_groupwise`.`d`.`glow`) `dd` join `20091125_groupwise`.`t_distinct` `di` where (`20091125_groupwise`.`di`.`id` = (select `20091125_groupwise`.`ds`.`id` AS `id` from `20091125_groupwise`.`t_distinct` `ds` where ((`20091125_groupwise`.`ds`.`glow` = `dd`.`glow`) and (`20091125_groupwise`.`ds`.`orderer` = `dd`.`orderer`)) order by `20091125_groupwise`.`ds`.`glow` desc,`20091125_groupwise`.`ds`.`orderer` desc,`20091125_groupwise`.`ds`.`id` desc limit 1))
</pre>
</div>
<p>Now there is no <code>filesort</code> in the <code>DEPENDENT SUBQUERY</code>, and the whole query completes in only <strong>3 ms</strong> (<strong>100</strong> times as fast).</p>
<p>The use of <code>orderer DESC</code> in the ordering condition of the subquery despite the fact that we need to return the group-wise minimum, not maximum may seem counterintuitive. However, this is only a workaround to make the ordering condition eligible for the index. The actual value of the <code>orderer</code> used the subquery is defined by the value of <code>MAX</code> or <code>MIN</code> returned by the outer query and the records are filtered on it anyway. Logically, <code>orderer ASC</code>, <code>orderer DESC</code> and even the absence of the <code>orderer</code> in the <code>ORDER BY</code> are the same. It&#8217;s just required to make <strong>MySQL</strong> use the index, and to do this we need to sort on all columns constituting the index, even though the first two ordering conditions are made constant by the filter and could be omitted as well.</p>
<p><strong>To be continued.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2009/11/25/mysql-selecting-records-holding-group-wise-maximum-resolving-ties/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MySQL: Selecting records holding group-wise maximum (on a unique column)</title>
		<link>http://explainextended.com/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/</link>
		<comments>http://explainextended.com/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 20:00:51 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=3736</guid>
		<description><![CDATA[Answering questions asked on the site.
Juan asks:
Regarding this question:
I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.
I am using MySQL but I would like to know if PostgreSQL, Oracle, Microsoft SQL Server or DB2 are much more [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Juan</strong> asks:</p>
<blockquote><p>Regarding <a href="http://stackoverflow.com/questions/612231/sql-select-rows-with-maxcolumn-value-distinct-by-another-column">this question</a>:</p>
<p>I would like to know what database is the most efficient when performing this type of query and if the solutions proposed are already the most efficient.</p>
<p>I am using <strong>MySQL</strong> but I would like to know if <strong>PostgreSQL</strong>, <strong>Oracle</strong>, <strong>Microsoft SQL Server</strong> or <strong>DB2</strong> are much more efficient.
</p></blockquote>
<p>A nice question, this is a common problem for most developers.</p>
<p>For those too lazy to follow the link, the problem in one sentence:</p>
<blockquote><p>How do I select the <em>whole</em> records, grouped on <code>grouper</code> and holding a group-wise maximum (or minimum) on other column?</p></blockquote>
<p>I&#8217;ve already covered this in passing in various previous articles in my blog, and now I&#8217;ll try to make a little summary.</p>
<p>According to the <strong>SQL</strong> standards, every expression used in the <code>SELECT</code> list of a query that deals with <code>GROUP BY</code> should be either grouped by (that is, used in the <code>GROUP BY</code>) or aggregated.</p>
<p>The reason behind this is quite logical: a grouping query takes the records that have something in common and shrinks them, making a single record out of the whole set constituting the group.</p>
<p>The grouping expressions, of course, share their values across all the records constituting the group (since they <strong>are</strong> what that makes the group a group), and hence can be used as is.</p>
<p>Other expressions can vary and therefore some algorithm should be applied for shrinking all existing values into a single value. This is what aggregating function does.</p>
<p>The problem is that the the aggregating functions are not interdependent: you cannot use the result of one function as an input to another one. Hence, you cannot select a <strong>record</strong> in an aggregate query: you can only select a tuple of aggregated values. It&#8217;s possible that each of these aggregated values in fact can never even found among the values of the original records. Say, with all values of <code>column</code> being strictly positive, you can never find a record holding a <code>SUM(column)</code>: it will be greater than any of the values of <code>column</code>.</p>
<p>However, with <code>MIN</code> or <code>MAX</code>, it is guaranteed that at least one record holds the non-<code>NULL</code> value returned by these aggregate functions. The problem is to return the whole record without aggregating the other columns.</p>
<p>As <strong>SQL</strong> evolved, different solutions had been proposed to solve this problem, and I&#8217;ll try to cover these solutions in this series of articles.</p>
<p>First, I&#8217;ll cover <strong>MySQL</strong> (as per request)</p>
<p><strong>MySQL</strong>&#8217;s dialect of <strong>SQL</strong> is much more poor than those used by the other systems. That&#8217;s why most of the solutions used by the other systems, like analytic functions, <code>DISTINCT ON</code> etc. cannot be used in <strong>MySQL</strong>.</p>
<p>However, <strong>MySQL</strong> still provides some ways to make the queries that solve this task more efficient.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-3736"></span><br />
<a href="#" onclick="xcollapse('X1276');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X1276" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_distinct (
      id INT NOT NULL PRIMARY KEY,
      orderer INT NOT NULL,
      glow INT NOT NULL,
      ghigh INT NOT NULL,
      stuffing VARCHAR(200) NOT NULL,
      KEY ix_distinct_glow_id (glow, id),
      KEY ix_distinct_ghigh_id (ghigh, id),
      KEY ix_distinct_glow_orderer (glow, orderer),
      KEY ix_distinct_ghigh_orderer (ghigh, orderer)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_distinct (id, orderer, glow, ghigh, stuffing)
SELECT  id, FLOOR(RAND(20091124) * 9) + 1,
        (id - 1) % 10 + 1,
        (id - 1) % 10000 + 1,
        LPAD(&#039;&#039;, 200, &#039;*&#039;)
FROM    filler;
</pre>
</div>
<p>This table has <strong>1,000,000</strong> records:</p>
<ul>
<li><code>id</code> is the <strong>PRIMARY KEY</strong></li>
<li><code>orderer</code> is filled with random values from <strong>1</strong> to <strong>10</strong></li>
<li><code>glow</code> is a low cardinality grouping field (<strong>10</strong> distinct values)</li>
<li><code>ghigh</code> is a high cardinality grouping field (<strong>10,000</strong> distinct values)</li>
</ul>
<p>There are composite indexes on grouping columns: those including the <code>id</code> as a secondary field and those including the <code>orderer</code>.</p>
<p>We will split the task into four large categories which depend on properties the column we maximize (or minimize) on and how should we process ties.</p>
<h3>The column being maximized is unique</h3>
<p>This means that the ties are impossible: the value of the column is guaranteed to identify the row (even outside the group).</p>
<p>All we have to do is to select the greatest value of the column and join the table back on this value.</p>
<p>Here&#8217;s how we do it in <strong>MySQL</strong>:</p>
<pre class="brush: sql">
SELECT  di.id, di.orderer, di.ghigh, di.glow
FROM    (
        SELECT  glow, MAX(id) AS id
        FROM    t_distinct d
        GROUP BY
                glow
        ) dd
JOIN    t_distinct di
ON      di.id = dd.id
</pre>
<p><a href="#" onclick="xcollapse('X1033');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1033" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>orderer</th>
<th>ghigh</th>
<th>glow</th>
</tr>
<tr>
<td class="integer">999991</td>
<td class="integer">8</td>
<td class="integer">9991</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">999992</td>
<td class="integer">9</td>
<td class="integer">9992</td>
<td class="integer">2</td>
</tr>
<tr>
<td class="integer">999993</td>
<td class="integer">5</td>
<td class="integer">9993</td>
<td class="integer">3</td>
</tr>
<tr>
<td class="integer">999994</td>
<td class="integer">7</td>
<td class="integer">9994</td>
<td class="integer">4</td>
</tr>
<tr>
<td class="integer">999995</td>
<td class="integer">9</td>
<td class="integer">9995</td>
<td class="integer">5</td>
</tr>
<tr>
<td class="integer">999996</td>
<td class="integer">6</td>
<td class="integer">9996</td>
<td class="integer">6</td>
</tr>
<tr>
<td class="integer">999997</td>
<td class="integer">2</td>
<td class="integer">9997</td>
<td class="integer">7</td>
</tr>
<tr>
<td class="integer">999998</td>
<td class="integer">1</td>
<td class="integer">9998</td>
<td class="integer">8</td>
</tr>
<tr>
<td class="integer">999999</td>
<td class="integer">8</td>
<td class="integer">9999</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">1000000</td>
<td class="integer">1</td>
<td class="integer">10000</td>
<td class="integer">10</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0004s (0.0029s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">dd.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_glow_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `20091124_groupwise`.`di`.`id` AS `id`,`20091124_groupwise`.`di`.`orderer` AS `orderer`,`20091124_groupwise`.`di`.`ghigh` AS `ghigh`,`20091124_groupwise`.`di`.`glow` AS `glow` from (select `20091124_groupwise`.`d`.`glow` AS `glow`,max(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`glow`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)
</pre>
</div>
<p>In the query plan we see <code>Using index for group-by</code>.</p>
<p>This access method is used by <strong>MySQL</strong> to serve <code>DISTINCT</code> and <code>GROUP BY</code> queries, or, being more exact, to build the list of distinct groupers using an index.</p>
<p>The idea behind the method is very simple:</p>
<ol>
<li>Find the first key for the index and return it</li>
<li>Iteratively find the next key (greater than the one previously found) traversing the index tree and return it</li>
</ol>
<p>Whenever the next index key is found, the aggregates are applied to that key.</p>
<p>The indexes in <strong>MySQL</strong> are built using a <strong>B-Tree</strong>. This means that the iterative searching will have to restart from the root key on each key. The <code>B-Tree</code> is <code>O(log(n))</code> levels deep in average, this means that the <code>index scan for group by</code> is <code>O(log(n))</code> times as long as the index scan on these value would take be the values unique.</p>
<p>Since the aggregates normally require all values from the group, there is no point in using this access method to calculate the aggregates: a sequential index scan would be way more efficient (without <code>O(log(n))</code> in the formula).</p>
<p><code>MIN</code> and <code>MAX</code>, however, are somewhat special aggregates, since aggregating a column with them means just taking a first (or a last) value from the appropriate index. <strong>MySQL</strong> is aware of this and uses the <code>index for group-by</code> when calculating <code>MIN</code> and <code>MAX</code> on secondary index columns. This is called <a href="http://dev.mysql.com/doc/refman/5.1/en/loose-index-scan.html">loose index scan</a>.</p>
<p>The <code>index for group-by</code>, when used over a composite index, naturally stops on the lowest group-wise value of the second column. The <code>MIN</code>, therefore, is free: the value of the second column (which constitutes the <code>MIN</code>) should be just returned along with the value of the grouping column.</p>
<p>A <code>MAX</code>, on the other hand, requires an additional index search to find the <em>last</em> value of the group.</p>
<p>Overall query time, thus, depends on the grouping column cardinality and on the aggregate function (<code>MIN</code> or <code>MAX</code> being used). The lower the cardinality is, the better; and <code>MIN</code> is faster than <code>MAX</code>.</p>
<p>We see that on a low cardinality column, <code>glow</code>, which has only <strong>10</strong> distinct values, the query takes only <strong>2 ms</strong>, which is next instant. We can hardly notice the difference between the <code>MIN</code> and <code>MAX</code> on such a fast query.</p>
<p>Let&#8217;s see the difference between two queries being grouped on a high cardinality column. Since the query would return <strong>10,000</strong> records, I&#8217;ll make it aggregate the final results as well to avoid bloating the post.</p>
<p>Here&#8217;s the <code>MIN</code> query:</p>
<pre class="brush: sql">
SELECT  SUM(di.id)
FROM    (
        SELECT  ghigh, MIN(id) AS id
        FROM    t_distinct d
        GROUP BY
                ghigh
        ) dd
JOIN    t_distinct di
ON      di.id = dd.id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(di.id)</th>
</tr>
<tr>
<td class="decimal">50005000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.1638s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">dd.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_ghigh_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">20835</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,min(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)
</pre>
<p>, and here&#8217;s the <code>MAX</code> one:</p>
<pre class="brush: sql">
SELECT  SUM(di.id)
FROM    (
        SELECT  ghigh, MAX(id) AS id
        FROM    t_distinct d
        GROUP BY
                ghigh
        ) dd
JOIN    t_distinct di
ON      di.id = dd.id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(di.id)</th>
</tr>
<tr>
<td class="decimal">9950005000</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.2800s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">di</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">dd.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">d</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_distinct_ghigh_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">20835</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select sum(`20091124_groupwise`.`di`.`id`) AS `SUM(di.id)` from (select `20091124_groupwise`.`d`.`ghigh` AS `ghigh`,max(`20091124_groupwise`.`d`.`id`) AS `id` from `20091124_groupwise`.`t_distinct` `d` group by `20091124_groupwise`.`d`.`ghigh`) `dd` join `20091124_groupwise`.`t_distinct` `di` where (`20091124_groupwise`.`di`.`id` = `dd`.`id`)
</pre>
<p>With the grouping column having high cardinality, this query needs much more <q>next key lookups</q> to complete and takes significantly more time. However, on a <strong>1,000,000</strong> records table, it&#8217;s still a matter of fractions of a second. The <code>MAX</code> query takes <strong>280 ms</strong>, which is almost twice the time of the <code>MIN</code> query (<strong>160 ms</strong>).</p>
<p>As was just described, the reason for that is that the <code>MAX</code> query need an extra lookup, since the records being iterated by <code>index for group-by</code> hold the <code>MIN</code>, not the <code>MAX</code>.</p>
<p>Unfortunately, <strong>MySQL</strong> cannot iterate the index backwards in a loose index scan, nor it supports the <code>DESC</code> clause in the indexes. That means that taking the group-wise maximum is less efficient than taking a group-wise minimum (with a loose index scan being used).</p>
<p>However, with the grouping columns having sufficiently low cardinality, both queries are very fast and the difference negligible.</p>
<p><strong>To be continued</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
