<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>EXPLAIN EXTENDED &#187; MySQL</title>
	<atom:link href="http://explainextended.com/category/mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://explainextended.com</link>
	<description>How to create fast database queries</description>
	<lastBuildDate>Wed, 25 Aug 2010 13:29:38 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>20 latest unique records</title>
		<link>http://explainextended.com/2010/08/24/20-latest-unique-records/</link>
		<comments>http://explainextended.com/2010/08/24/20-latest-unique-records/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 19:00:38 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4884</guid>
		<description><![CDATA[From Stack Overflow: I have a logfile which logs the insert/delete/updates from all kinds of tables. I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (datetime DESC) A common solution for such a task would be writing an aggregate query with [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/3555118/mysql-select-20-latest-entries-in-logfile-from-unique-persons"><strong>Stack Overflow</strong></a>:</p>
<blockquote>
<p>I have a logfile which logs the insert/delete/updates from all kinds of tables.</p>
<p>I would like to get an overview of for example the last 20 people which records where updated, ordered by the last update (<code>datetime DESC</code>)</p>
</blockquote>
<p>A common solution for such a task would be writing an aggregate query with <code>ORDER BY</code> and <code>LIMIT</code>:</p>
<pre class="brush: sql">
SELECT  person, MAX(ts) AS last_update
FROM    logfile
GROUP BY
        person
ORDER BY
        last_update DESC
LIMIT 20
</pre>
<p>What&#8217;s bad in this solution? Performance, as usual.</p>
<p>Since <code>last_update</code> is an aggregate, it cannot be indexed. And <code>ORDER BY</code> on unindexed fields results in our good old friend, <code>filesort</code>.</p>
<p>Note that even in this case the indexes can be used and the full table scan can be avoided: if there is an index on <code>(person, ts)</code>, <code>MySQL</code> will tend to use a <a href="http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html">loose index scan</a> on this index, which can save this query if there are relatively few persons in the table. However, if there are many (which is what we can expect for a log table), loose index scan can even degrade performance and generally will be avoided by <code>MySQL</code>.</p>
<p>We should use another approach here. Let&#8217;s create a sample table and test this approach:<br />
<span id="more-4884"></span><br />
<a href="#" onclick="xcollapse('X8484');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X8484" style="display: none; background: transparent;">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE logfile (
        id INT NOT NULL PRIMARY KEY,
        sparse INT NOT NULL,
        dense INT NOT NULL,
        ts DATETIME NOT NULL,
        stuffing VARCHAR(100) NOT NULL,
        KEY ix_logfile_ts_id (ts, id),
        KEY ix_logfile_sparse_ts_id (sparse, ts, id),
        KEY ix_logfile_dense_ts_id (dense, ts, id)
) ENGINE=InnoDB;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    logfile
SELECT  id,
        CEILING(RAND(20100824) * 30),
        CEILING(RAND(20100824 &lt;&lt; 1) * 30000),
        &#039;2010-08-24&#039; - INTERVAL RAND(20100824 &lt;&lt; 2) * 10000000 SECOND,
        LPAD(&#039;&#039;, 100, &#039;*&#039;)
FROM    filler;
</pre>
</div>
<p>This table has <strong>1,000,000</strong> records.</p>
<p>Instead of a single field, <code>person</code>, I created two different fields: <code>sparse</code> and <code>dense</code>. The first one has <strong>30</strong> distinct values, while the second one has <strong>30,000</strong>. This will help us to see how data distribution affects performance of different queries.</p>
<p>Let&#8217;s run our original queries. We&#8217;ll adjust them a little to help <code>MySQL</code> to pick correct plans:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  sparse, MAX(ts) AS last_update
        FROM    logfile
        GROUP BY
                sparse
        ) q
ORDER BY
        last_update DESC
LIMIT 20;
</pre>
<p><a href="#" onclick="xcollapse('X4382');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X4382" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>sparse</th>
<th>last_update</th>
</tr>
<tr>
<td class="integer">15</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">26</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">11</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">30</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">29</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">13</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">27</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">12</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">14</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">24</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">17</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">23</td>
<td class="timestamp">2010-08-23 23:55:08</td>
</tr>
<tr>
<td class="integer">19</td>
<td class="timestamp">2010-08-23 23:55:07</td>
</tr>
<tr>
<td class="integer">20</td>
<td class="timestamp">2010-08-23 23:53:44</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="timestamp">2010-08-23 23:51:52</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="timestamp">2010-08-23 23:50:53</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0005s (0.0026s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">500133</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `q`.`sparse` AS `sparse`,`q`.`last_update` AS `last_update` from (select `20100824_latest`.`logfile`.`sparse` AS `sparse`,max(`20100824_latest`.`logfile`.`ts`) AS `last_update` from `20100824_latest`.`logfile` group by `20100824_latest`.`logfile`.`sparse`) `q` order by `q`.`last_update` desc limit 20
</pre>
</div>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  dense, MAX(ts) AS last_update
        FROM    logfile
        GROUP BY
                dense
        ) q
ORDER BY
        last_update DESC
LIMIT 20;
</pre>
<p><a href="#" onclick="xcollapse('X2066');return false;"><strong>View query results</strong></a><br />
</p>
<div id="X2066" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>dense</th>
<th>last_update</th>
</tr>
<tr>
<td class="integer">25324</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">13060</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">3268</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">2327</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">23968</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">1622</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">29693</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">655</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">5802</td>
<td class="timestamp">2010-08-23 23:58:07</td>
</tr>
<tr>
<td class="integer">11843</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">18894</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">6180</td>
<td class="timestamp">2010-08-23 23:57:26</td>
</tr>
<tr>
<td class="integer">9398</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">18012</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">25758</td>
<td class="timestamp">2010-08-23 23:56:49</td>
</tr>
<tr>
<td class="integer">2379</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">821</td>
<td class="timestamp">2010-08-23 23:56:39</td>
</tr>
<tr>
<td class="integer">4186</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">20198</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">18615</td>
<td class="timestamp">2010-08-23 23:56:01</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0005s (0.5000s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30000</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">500133</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `q`.`dense` AS `dense`,`q`.`last_update` AS `last_update` from (select `20100824_latest`.`logfile`.`dense` AS `dense`,max(`20100824_latest`.`logfile`.`ts`) AS `last_update` from `20100824_latest`.`logfile` group by `20100824_latest`.`logfile`.`dense`) `q` order by `q`.`last_update` desc limit 20
</pre>
</div>
<p>We see that both queries use the same plan and return <strong>20</strong> records,  but the first one is instant, while the second one runs for <strong>500 ms</strong>. Both queries use <strong>filesort</strong>, but in second case it has to sort <strong>30,000</strong> records (compared to <strong>30</strong> in the first case).</p>
<p>In this case, it is better to use another approach.</p>
<p>With our original query, we take each person and see which record is latest for this person. But we can as well do it the other way round: take the records in descending order, one by one, and for each record see if it&#8217;s latest for this person. If it is, we should return it; if it&#8217;s not, this means that the record for this person has already been returned (remember, we take them in descending order).</p>
<p>It&#8217;s easy to see that <strong>20</strong> records returned this way will, first, belong to <strong>20</strong> different people, and, second, be the latest records of their respective persons. This is exactly what we need.</p>
<p>The records can easily be scanned in the descending order using the index on <code>(ts, id)</code>. But how do we check if the record is the latest? It&#8217;s simple: we just take the last record for the given person from the index on <code>(person, ts, id)</code> and compare its <code>id</code>. It takes but a single index seek per record and is almost instant.</p>
<p>Here&#8217;s the query to do it:</p>
<pre class="brush: sql">
SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id =
        (
        SELECT  id
        FROM    logfile lfi
        WHERE   lfi.sparse = lf.sparse
        ORDER BY
                sparse DESC, ts DESC, id DESC
        LIMIT 1
        )
ORDER BY
        ts DESC, id DESC
LIMIT 20
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>sparse</th>
<th>dense</th>
<th>ts</th>
</tr>
<tr>
<td class="integer">121946</td>
<td class="integer">15</td>
<td class="integer">25324</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">276499</td>
<td class="integer">11</td>
<td class="integer">3268</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">62419</td>
<td class="integer">26</td>
<td class="integer">13060</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">254750</td>
<td class="integer">30</td>
<td class="integer">2327</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">96079</td>
<td class="integer">29</td>
<td class="integer">23968</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">290657</td>
<td class="integer">13</td>
<td class="integer">1622</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">278842</td>
<td class="integer">27</td>
<td class="integer">29693</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">329318</td>
<td class="integer">7</td>
<td class="integer">655</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">384956</td>
<td class="integer">5</td>
<td class="integer">11843</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">386333</td>
<td class="integer">12</td>
<td class="integer">18894</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">260404</td>
<td class="integer">14</td>
<td class="integer">9398</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">471000</td>
<td class="integer">6</td>
<td class="integer">18012</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">172079</td>
<td class="integer">2</td>
<td class="integer">2379</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">112653</td>
<td class="integer">24</td>
<td class="integer">4186</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">291683</td>
<td class="integer">17</td>
<td class="integer">20198</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">144673</td>
<td class="integer">23</td>
<td class="integer">25055</td>
<td class="timestamp">2010-08-23 23:55:08</td>
</tr>
<tr>
<td class="integer">172118</td>
<td class="integer">19</td>
<td class="integer">29039</td>
<td class="timestamp">2010-08-23 23:55:07</td>
</tr>
<tr>
<td class="integer">198913</td>
<td class="integer">20</td>
<td class="integer">9887</td>
<td class="timestamp">2010-08-23 23:53:44</td>
</tr>
<tr>
<td class="integer">491436</td>
<td class="integer">10</td>
<td class="integer">17752</td>
<td class="timestamp">2010-08-23 23:51:52</td>
</tr>
<tr>
<td class="integer">346651</td>
<td class="integer">4</td>
<td class="integer">10951</td>
<td class="timestamp">2010-08-23 23:50:53</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0007s (0.0034s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">lf</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_ts_id</td>
<td class="varchar">12</td>
<td class="varchar"></td>
<td class="bigint">20</td>
<td class="double">2500660.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">lfi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">4</td>
<td class="varchar">20100824_latest.lf.sparse</td>
<td class="bigint">27785</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100824_latest.lf.sparse&#39; of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`sparse` = `20100824_latest`.`lf`.`sparse`) order by `20100824_latest`.`lfi`.`sparse` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20
</pre>
<p>As we can see, this query uses two different indexes. The first one on <code>(ts, id)</code> is used to scan all records according to the overall timeline; the second one, on <code>(sparse, ts, id)</code> is used to find the <code>id</code> of the latest entry for a person and check if it&#8217;s the same as the record selected from the general timeline.</p>
<p>The query is instant: <strong>3 ms</strong>.</p>
<p>Let&#8217;s check the same query on a column with lots of values:</p>
<pre class="brush: sql">
SELECT  id, sparse, dense, ts
FROM    logfile lf
WHERE   id =
        (
        SELECT  id
        FROM    logfile lfi
        WHERE   lfi.dense = lf.dense
        ORDER BY
                dense DESC, ts DESC, id DESC
        LIMIT 1
        )
ORDER BY
        ts DESC, id DESC
LIMIT 20
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>sparse</th>
<th>dense</th>
<th>ts</th>
</tr>
<tr>
<td class="integer">121946</td>
<td class="integer">15</td>
<td class="integer">25324</td>
<td class="timestamp">2010-08-23 23:59:58</td>
</tr>
<tr>
<td class="integer">276499</td>
<td class="integer">11</td>
<td class="integer">3268</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">62419</td>
<td class="integer">26</td>
<td class="integer">13060</td>
<td class="timestamp">2010-08-23 23:59:56</td>
</tr>
<tr>
<td class="integer">254750</td>
<td class="integer">30</td>
<td class="integer">2327</td>
<td class="timestamp">2010-08-23 23:59:42</td>
</tr>
<tr>
<td class="integer">96079</td>
<td class="integer">29</td>
<td class="integer">23968</td>
<td class="timestamp">2010-08-23 23:59:32</td>
</tr>
<tr>
<td class="integer">290657</td>
<td class="integer">13</td>
<td class="integer">1622</td>
<td class="timestamp">2010-08-23 23:58:54</td>
</tr>
<tr>
<td class="integer">278842</td>
<td class="integer">27</td>
<td class="integer">29693</td>
<td class="timestamp">2010-08-23 23:58:53</td>
</tr>
<tr>
<td class="integer">329318</td>
<td class="integer">7</td>
<td class="integer">655</td>
<td class="timestamp">2010-08-23 23:58:46</td>
</tr>
<tr>
<td class="integer">277612</td>
<td class="integer">15</td>
<td class="integer">5802</td>
<td class="timestamp">2010-08-23 23:58:07</td>
</tr>
<tr>
<td class="integer">384956</td>
<td class="integer">5</td>
<td class="integer">11843</td>
<td class="timestamp">2010-08-23 23:58:00</td>
</tr>
<tr>
<td class="integer">386333</td>
<td class="integer">12</td>
<td class="integer">18894</td>
<td class="timestamp">2010-08-23 23:57:44</td>
</tr>
<tr>
<td class="integer">201899</td>
<td class="integer">7</td>
<td class="integer">6180</td>
<td class="timestamp">2010-08-23 23:57:26</td>
</tr>
<tr>
<td class="integer">260404</td>
<td class="integer">14</td>
<td class="integer">9398</td>
<td class="timestamp">2010-08-23 23:57:24</td>
</tr>
<tr>
<td class="integer">471000</td>
<td class="integer">6</td>
<td class="integer">18012</td>
<td class="timestamp">2010-08-23 23:56:58</td>
</tr>
<tr>
<td class="integer">451808</td>
<td class="integer">26</td>
<td class="integer">25758</td>
<td class="timestamp">2010-08-23 23:56:49</td>
</tr>
<tr>
<td class="integer">172079</td>
<td class="integer">2</td>
<td class="integer">2379</td>
<td class="timestamp">2010-08-23 23:56:48</td>
</tr>
<tr>
<td class="integer">367042</td>
<td class="integer">11</td>
<td class="integer">821</td>
<td class="timestamp">2010-08-23 23:56:39</td>
</tr>
<tr>
<td class="integer">112653</td>
<td class="integer">24</td>
<td class="integer">4186</td>
<td class="timestamp">2010-08-23 23:56:13</td>
</tr>
<tr>
<td class="integer">291683</td>
<td class="integer">17</td>
<td class="integer">20198</td>
<td class="timestamp">2010-08-23 23:56:12</td>
</tr>
<tr>
<td class="integer">127839</td>
<td class="integer">11</td>
<td class="integer">18615</td>
<td class="timestamp">2010-08-23 23:56:01</td>
</tr>
<tr class="statusbar">
<td colspan="100">20 rows fetched in 0.0007s (0.0031s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">lf</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_ts_id</td>
<td class="varchar">12</td>
<td class="varchar"></td>
<td class="bigint">20</td>
<td class="double">2500660.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">lfi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">4</td>
<td class="varchar">20100824_latest.lf.dense</td>
<td class="bigint">8</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100824_latest.lf.dense&#39; of SELECT #2 was resolved in SELECT #1
select `20100824_latest`.`lf`.`id` AS `id`,`20100824_latest`.`lf`.`sparse` AS `sparse`,`20100824_latest`.`lf`.`dense` AS `dense`,`20100824_latest`.`lf`.`ts` AS `ts` from `20100824_latest`.`logfile` `lf` where (`20100824_latest`.`lf`.`id` = (select `20100824_latest`.`lfi`.`id` from `20100824_latest`.`logfile` `lfi` where (`20100824_latest`.`lfi`.`dense` = `20100824_latest`.`lf`.`dense`) order by `20100824_latest`.`lfi`.`dense` desc,`20100824_latest`.`lfi`.`ts` desc,`20100824_latest`.`lfi`.`id` desc limit 1)) order by `20100824_latest`.`lf`.`ts` desc,`20100824_latest`.`lf`.`id` desc limit 20
</pre>
<p>We see that the query is instant again, despite the data distribution being completely different. This is because the query only skips the records which are not the latest of their persons, and the total number of the records to scan is defined by how many records do we browse before we encounter the <strong>20th</strong> unique value in our scan. This value decreases exponentially as the number of distinct persons in the table grows, but with <strong>99%</strong> probability it won&#8217;t exceed <strong>100</strong> records even for only <strong>20</strong> distinct persons in the table.</p>
<p>The only problem that can arise here is that the number of distinct persons in the table is <em>less</em> than the <code>LIMIT</code> we set. In this case, no new records after the limit is reached can be returned, and a full index scan (accompanied by an index seek once per record) will ultimately be performed.</p>
<p>To work around this, the following simple query can be run in advance:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT sparse
        FROM    logfile
        LIMIT 20
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">20</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0015s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X1238');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1238" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_sparse_ts_id</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">19</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by; Using temporary</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from (select distinct `20100824_latest`.`logfile`.`sparse` AS `sparse` from `20100824_latest`.`logfile` limit 20) `q`
</pre>
</div>
<p>This query will return the actual number of distinct persons in the table if there are less than <strong>20</strong> (or <strong>20</strong> if these are more).</p>
<p>This query is instant even for the dense data:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT dense
        FROM    logfile
        LIMIT 20
        ) q
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">20</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0024s)</td>
</tr>
</table>
</div>
<p><a href="#" onclick="xcollapse('X1940');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1940" style="display: none; background: transparent;">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">logfile</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_logfile_dense_ts_id</td>
<td class="varchar">16</td>
<td class="varchar"></td>
<td class="bigint">500132</td>
<td class="double">12.50</td>
<td class="varchar">Using index; Using temporary</td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from (select distinct `20100824_latest`.`logfile`.`dense` AS `dense` from `20100824_latest`.`logfile` limit 20) `q`
</pre>
</div>
<p>This needs to be run as a separate query because <strong>MySQL</strong> does not allow using anything other than constants in the <code>LIMIT</code> clause. The result of this query should be substituted into the <code>LIMIT</code> clause on the client or in a dynamically composed query on the server.</p>
<h3>Summary</h3>
<p>To select a number of latest unique records from a table, one can use aggregate functions, however, this can decrease the query performance.</p>
<p>This can be done more efficiently by creating two different indexes on the table and checking the records taken from the general timeline against the end of the index on the person&#8217;s timeline.</p>
<p>To avoid performance degradation in marginal cases (when the total number of persons in the table is less than <code>LIMIT</code>), it is possible to make an additional check for the total number of distinct records and adjust the <code>LIMIT</code> clause if there are not enough records.</p>
<p><strong>P. S.</strong> I decided to enable comments for the technical posts as well. You are welcome to comment.</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/08/24/20-latest-unique-records/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Indexing for ORDER BY / LIMIT</title>
		<link>http://explainextended.com/2010/06/30/indexing-for-order-by-limit/</link>
		<comments>http://explainextended.com/2010/06/30/indexing-for-order-by-limit/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 19:00:34 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4831</guid>
		<description><![CDATA[Answering questions asked on the site. Frode Underhill asks: I have some applications that are logging to a MySQL database table. The table is pretty standard on the form: timeBIGINT(20) sourceTINYINT(4) severityENUM textVARCHAR(255) , where source identifies the system that generated the log entry. There are very many entries in the table (>100 million), of [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Frode Underhill</strong> asks:</p>
<blockquote><p>I have some applications that are logging to a <strong>MySQL</strong> database table.</p>
<p>The table is pretty standard on the form:</p>
<table class="excel">
<tr>
<th>time<br/><code>BIGINT(20)</code></th>
<th>source<br/><code>TINYINT(4)</code></th>
<th>severity<br/><code>ENUM</code></th>
<th>text<br/><code>VARCHAR(255)</code></th>
</tr>
</table>
<p>, where <code>source</code> identifies the system that generated the log entry.</p>
<p>There are very many entries in the table (<strong>>100 million</strong>), of which <strong>99.9999%</strong> are debug or info messages.</p>
<p>I&#8217;m making an interface for browsing this log, which means I&#8217;ll be doing queries like</p>
<pre class="brush: sql">
SELECT  *
FROM    log
WHERE   source = 2
        AND severity IN (1,2)
        AND time &gt; 12345
ORDER BY
        time ASC
LIMIT 30
</pre>
<p><!-- --></p>
<p>, if I want to find debug or info log entries from a certain point in time, or </p>
<pre class="brush: sql">
SELECT  *
FROM    log
WHERE   source = 2
        AND severity IN (1,2)
        AND time &lt; 12345
ORDER BY
        time DESC
LIMIT 30
</pre>
<p><!-- --></p>
<p>for finding entries right before a certain time.</p>
<p>How would one go about indexing &#038; querying such a table?</p>
<p>I thought I had it figured out (I pretty much just tried every different combination of columns in an index), but there&#8217;s always some set of parameters that results in a really slow query.
</p></blockquote>
<p>The problem is that you cannot use a single index both for filtering and ordering if you have a ranged condition (<code>severity IN (1, 2)</code> in this case).</p>
<p>Recently I wrote an article with a proposal to improve <strong>SQL</strong> optimizer to handle these conditions. If a range has low cardinality (this is, there are few values that con possibly satisfy the range), then the query could be improved by rewriting the range as a series of individual queries, each one using one of the values constituting the range in an equijoin:</p>
<ul>
<li><a href="/2010/05/19/things-sql-needs-determining-range-cardinality/"><strong>Things SQL needs: determining range cardinality</strong></a></li>
</ul>
<p>No optimizers can handle this condition automatically yet, so we&#8217;ll need to emulate it.</p>
<p>Since the <code>severity</code> field is defined as an <code>enum</code> with only <strong>5</strong> values possible, any range condition on this field can be satisfied by no more than <strong>5</strong> distinct values, thus making this table ideal for rewriting the query.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-4831"></span><br />
<a href="#" onclick="xcollapse('X1733');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X1733" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_log (
        id INT NOT NULL,
        ts BIGINT NOT NULL,
        source TINYINT(4) NOT NULL,
        severity ENUM(&#039;DEBUG&#039;,&#039;INFO&#039;,&#039;WARNING&#039;,&#039;ERROR&#039;,&#039;FATAL&#039;) NOT NULL,
        tx VARCHAR(255)
) ENGINE=MyISAM;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(3500);
COMMIT;

INSERT
INTO    t_log
SELECT  (f1.id - 1) * 3000 + f2.id,
        UNIX_TIMESTAMP(&#039;2010-06-29&#039; - INTERVAL (f1.id - 1) * 3000 + f2.id SECOND),
        CEILING(RAND(20100629) * 10),
        5 - FLOOR(LOG10(CEILING(RAND(20100629 &lt;&lt; 1) * 99999))),
        CONCAT(&#039;Message &#039;, (f1.id - 1) * 3000 + f2.id)
FROM    filler f1
CROSS JOIN
        filler f2;

CREATE INDEX ix_log_source_ts ON t_log (source, ts);

CREATE INDEX ix_log_source_severity_ts ON t_log (source, severity, ts);
</pre>
</div>
<p>This <strong>MyISAM</strong> table has <strong>12,250,000</strong> records, with <strong>10</strong> random sources (distributed evenly) and <strong>5</strong> random severities (distributed logarithmically):</p>
<pre class="brush: sql">
SELECT  severity, COUNT(*)
FROM    t_log
GROUP BY
        severity;
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>severity</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="char">DEBUG</td>
<td class="bigint">11024646</td>
</tr>
<tr>
<td class="char">INFO</td>
<td class="bigint">1102668</td>
</tr>
<tr>
<td class="char">WARNING</td>
<td class="bigint">110557</td>
</tr>
<tr>
<td class="char">ERROR</td>
<td class="bigint">10948</td>
</tr>
<tr>
<td class="char">FATAL</td>
<td class="bigint">1181</td>
</tr>
</table>
</div>
<p>We also created two indexes (one on <code>(source, ts)</code>, the other one on <code>(source, severity, ts)</code>).</p>
<p>Now, let&#8217;s try to run some queries as is:</p>
<pre class="brush: sql">
SELECT  *
FROM    t_log
WHERE   source = 2
        AND severity IN (1, 2)
        AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1205</td>
<td class="bigint">1277753995</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1205</td>
</tr>
<tr>
<td class="integer">1227</td>
<td class="bigint">1277753973</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1227</td>
</tr>
<tr>
<td class="integer">1243</td>
<td class="bigint">1277753957</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1243</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">1546</td>
<td class="bigint">1277753654</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1546</td>
</tr>
<tr>
<td class="integer">1575</td>
<td class="bigint">1277753625</td>
<td class="tinyint">2</td>
<td class="char">DEBUG</td>
<td class="varchar">Message 1575</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.0027s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_ts</td>
<td class="varchar">9</td>
<td class="varchar"></td>
<td class="bigint">997923</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
</table>
</div>
<pre>
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (1,2)) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30
</pre>
<p>This is very fast. It uses the index which does not include <code>severity</code>: since the values <strong>1</strong> and <strong>2</strong> are very frequent, it&#8217;s much more efficient just to filter them out. The index preserves the order, that&#8217;s why there is no <code>filesort</code> in the plan.</p>
<pre class="brush: sql">
SELECT  *
FROM    t_log
WHERE   source = 2
        AND severity IN (4, 5)
        AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">6139</td>
<td class="bigint">1277749061</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 6139</td>
</tr>
<tr>
<td class="integer">6369</td>
<td class="bigint">1277748831</td>
<td class="tinyint">2</td>
<td class="char">FATAL</td>
<td class="varchar">Message 6369</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">297128</td>
<td class="bigint">1277458072</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 297128</td>
</tr>
<tr>
<td class="integer">298729</td>
<td class="bigint">1277456471</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 298729</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.0093s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">1182</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using filesort</td>
</tr>
</table>
</div>
<pre>
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (4,5)) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30
</pre>
<p>This is very fast too. The index which includes <code>severity</code> is used (along with the <code>filesort</code> of course, because the order cannot be preserved with multiple values of <code>severity</code>), but the total number of records evaluated is so small that the <code>filesort</code> is not much of a problem.</p>
<p>Now, let&#8217;s try to include <strong>3</strong> into the query above:</p>
<pre class="brush: sql">
SELECT  *
FROM    t_log
WHERE   source = 2
        AND severity IN (3, 4)
        AND ts &lt;= 1277754000
ORDER BY
        source DESC, ts DESC
LIMIT 30
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1507</td>
<td class="bigint">1277753693</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 1507</td>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">4154</td>
<td class="bigint">1277751046</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 4154</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">30118</td>
<td class="bigint">1277725082</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 30118</td>
</tr>
<tr>
<td class="integer">31321</td>
<td class="bigint">1277723879</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 31321</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.2496s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">12168</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using filesort</td>
</tr>
</table>
</div>
<pre>
select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (3,4)) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30
</pre>
<p>Now, this runs for almost <strong>250 ms</strong>. Why?</p>
<p>There are <strong>110,557</strong> records with <code>severity = 'WARNING'</code>. This is too many for a filesort but too few for <code>using where</code> (filtering the records with the index that preserves the order). There will be too many records that will need to be skipped.</p>
<p>To work around this, we could combine the queries using <code>UNION ALL</code>. Since the original query uses <code>ORDER BY</code> and <code>LIMIT</code>, we may put them into two separate queries (which will yield <strong>60</strong> records) and finally apply it to the end resultset (to get the <strong>30</strong> records that are guaranteed to be contained among these <strong>60</strong>):</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  *
        FROM    t_log
        WHERE   source = 2
                AND severity = 3
                AND ts &lt;= 1277754000
        ORDER BY
                source DESC, ts DESC
        LIMIT 30
        ) q
UNION ALL
SELECT  *
FROM    (
        SELECT  *
        FROM    t_log
        WHERE   source = 2
                AND severity = 4
                AND ts &lt;= 1277754000
        ORDER BY
                source DESC, ts DESC
        LIMIT 30
        ) q
ORDER BY
        ts DESC
LIMIT 30
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1507</td>
<td class="bigint">1277753693</td>
<td class="tinyint">2</td>
<td class="varchar">WARNING</td>
<td class="varchar">Message 1507</td>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="varchar">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">4154</td>
<td class="bigint">1277751046</td>
<td class="tinyint">2</td>
<td class="varchar">WARNING</td>
<td class="varchar">Message 4154</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">30118</td>
<td class="bigint">1277725082</td>
<td class="tinyint">2</td>
<td class="varchar">WARNING</td>
<td class="varchar">Message 30118</td>
</tr>
<tr>
<td class="integer">31321</td>
<td class="bigint">1277723879</td>
<td class="tinyint">2</td>
<td class="varchar">ERROR</td>
<td class="varchar">Message 31321</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0013s (0.0037s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">11094</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">UNION</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">1074</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union1,3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Using filesort</td>
</tr>
</table>
</div>
<pre>
select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` = 3) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30) `q` union all select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`t_log`.`id` AS `id`,`20100630_range`.`t_log`.`ts` AS `ts`,`20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity`,`20100630_range`.`t_log`.`tx` AS `tx` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` = 4) and (`20100630_range`.`t_log`.`ts` &lt;= 1277754000)) order by `20100630_range`.`t_log`.`source` desc,`20100630_range`.`t_log`.`ts` desc limit 30) `q` order by `ts` desc limit 30
</pre>
<p>This is much faster.</p>
<p>However, this solution requires composing the query dynamically, depending on the number of the severities in the condition. Is it possible to make this all in one static query that will accept the parameters in the <code>IN</code> list?</p>
<p>We can do it by using the applying the solution using to retrieve <q>greatest-n-per-group</q> in <strong>MySQL</strong>.</p>
<p>To do this, we will just select the <strong>30</strong>th timestamp of each <code>severity</code> and find all records with the higher timestamps.</p>
<p>This can be done using a join:</p>
<pre class="brush: sql">
SELECT  *
FROM    (
        SELECT  l.*
        FROM    (
                SELECT  source,
                        severity,
                        (
                        SELECT  ts
                        FROM    t_log li
                        WHERE   li.source = ss.source
                                AND li.severity = ss.severity
                                AND ts &lt;= 1277754000
                        ORDER BY
                                li.source DESC, li.severity DESC, li.ts DESC
                        LIMIT 29, 1
                        ) AS mts
                FROM    (
                        SELECT  DISTINCT source, severity
                        FROM    t_log
                        WHERE   source = 2
                                AND severity IN (3, 4)
                        ) ss
                ) s
        JOIN    t_log l
        ON      l.source &gt;= s.source
                AND l.source &lt;= s.source
                AND l.severity = s.severity
                AND l.ts &gt;= s.mts
                AND l.ts &lt;= 1277754000
        ) q
ORDER BY
        ts DESC
LIMIT 30;
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>ts</th>
<th>source</th>
<th>severity</th>
<th>tx</th>
</tr>
<tr>
<td class="integer">1507</td>
<td class="bigint">1277753693</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 1507</td>
</tr>
<tr>
<td class="integer">2333</td>
<td class="bigint">1277752867</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 2333</td>
</tr>
<tr>
<td class="integer">4154</td>
<td class="bigint">1277751046</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 4154</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">30118</td>
<td class="bigint">1277725082</td>
<td class="tinyint">2</td>
<td class="char">WARNING</td>
<td class="varchar">Message 30118</td>
</tr>
<tr>
<td class="integer">31321</td>
<td class="bigint">1277723879</td>
<td class="tinyint">2</td>
<td class="char">ERROR</td>
<td class="varchar">Message 31321</td>
</tr>
<tr class="statusbar">
<td colspan="100">30 rows fetched in 0.0014s (0.0040s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">60</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">l</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">10</td>
<td class="varchar"></td>
<td class="bigint">30</td>
<td class="double">40833332.00</td>
<td class="varchar">Range checked for each record (index map: 0&#215;3)</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived5&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">2</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_log</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">2</td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index for group-by</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">li</td>
<td class="varchar">ref</td>
<td class="varchar">ix_log_source_ts,ix_log_source_severity_ts</td>
<td class="varchar">ix_log_source_severity_ts</td>
<td class="varchar">2</td>
<td class="varchar">ss.source,ss.severity</td>
<td class="bigint">245000</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;ss.source&#39; of SELECT #4 was resolved in SELECT #3
Field or reference &#39;ss.severity&#39; of SELECT #4 was resolved in SELECT #3
select `q`.`id` AS `id`,`q`.`ts` AS `ts`,`q`.`source` AS `source`,`q`.`severity` AS `severity`,`q`.`tx` AS `tx` from (select `20100630_range`.`l`.`id` AS `id`,`20100630_range`.`l`.`ts` AS `ts`,`20100630_range`.`l`.`source` AS `source`,`20100630_range`.`l`.`severity` AS `severity`,`20100630_range`.`l`.`tx` AS `tx` from (select `ss`.`source` AS `source`,`ss`.`severity` AS `severity`,(select `20100630_range`.`li`.`ts` from `20100630_range`.`t_log` `li` where ((`20100630_range`.`li`.`source` = `ss`.`source`) and (`20100630_range`.`li`.`severity` = `ss`.`severity`) and (`20100630_range`.`li`.`ts` &lt;= 1277754000)) order by `20100630_range`.`li`.`source` desc,`20100630_range`.`li`.`severity` desc,`20100630_range`.`li`.`ts` desc limit 29,1) AS `mts` from (select distinct `20100630_range`.`t_log`.`source` AS `source`,`20100630_range`.`t_log`.`severity` AS `severity` from `20100630_range`.`t_log` where ((`20100630_range`.`t_log`.`source` = 2) and (`20100630_range`.`t_log`.`severity` in (3,4)))) `ss`) `s` join `20100630_range`.`t_log` `l` where ((`20100630_range`.`l`.`severity` = `s`.`severity`) and (`20100630_range`.`l`.`source` &gt;= `s`.`source`) and (`20100630_range`.`l`.`source` &lt;= `s`.`source`) and (`20100630_range`.`l`.`ts` &gt;= `s`.`mts`) and (`20100630_range`.`l`.`ts` &lt;= 1277754000))) `q` order by `q`.`ts` desc limit 30
</pre>
<p>All possible values of <code>source</code> and <code>severity</code> are selected using a loose scan (which is instant since there are few of them). Each pair of values is then used as a join condition. A single index range satisfies each pair of values, so each join iteration uses an index efficiently (actually, the access path is reevaluated for each iteration as shown by <code>Range checked for each record (index map: 0x3)</code>.</p>
<p>The total number of records that would be returned by this query be there no <code>LIMIT</code> is <strong>60</strong> or maybe even more (in case of ties on <code>ts</code>). However, we don&#8217;t need to resolve the ties in the subqueries, since the final <code>ORDER BY / LIMIT</code> does this for us.</p>
<p>The query completes in <strong>4 ms</strong> which is instant. More than that, it does not need to be rewritten to handle different combinations of values: they could be provided in a single <code>IN</code> clause.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/06/30/indexing-for-order-by-limit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns</title>
		<link>http://explainextended.com/2010/05/27/left-join-is-null-vs-not-in-vs-not-exists-nullable-columns/</link>
		<comments>http://explainextended.com/2010/05/27/left-join-is-null-vs-not-in-vs-not-exists-nullable-columns/#comments</comments>
		<pubDate>Thu, 27 May 2010 19:00:15 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4793</guid>
		<description><![CDATA[In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL. Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table. In SQL, there are at least [...]]]></description>
			<content:encoded><![CDATA[<p>In one of the previous articles I discussed performance of the <a href="/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/">three methods to implement an anti-join in <strong>MySQL</strong></a>.</p>
<p>Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.</p>
<p>In <strong>SQL</strong>, there are at least three methods to implement it:</p>
<h3>LEFT JOIN / IS NULL</h3>
<pre class="brush: sql">
SELECT  o.*
FROM    outer o
LEFT JOIN
        inner i
ON      i.value = o.value
WHERE   i.value IS NULL
</pre>
<h3>NOT IN</h3>
<pre class="brush: sql">
SELECT  o.*
FROM    outer o
WHERE   o.value NOT IN
        (
        SELECT  value
        FROM    inner
        )
</pre>
<h3>NOT EXISTS</h3>
<pre class="brush: sql">
SELECT  o.*
FROM    outer o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    inner i
        WHERE   i.value = o.value
        )
</pre>
<p>When <code>inner.value</code> is marked as <code>NOT NULL</code>, all these queries are semantically equivalent and with proper indexing have similarly optimized execution plans in <strong>MySQL</strong>.</p>
<p>Now, what if <code>inner.value</code> is not nullable and does contain some <code>NULL</code> values?</p>
<p>Let&#8217;s create some sample tables:<br />
<span id="more-4793"></span><br />
<a href="#" onclick="xcollapse('X2583');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X2583" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=MyISAM;

CREATE TABLE t_inner (
        id INT NOT NULL PRIMARY KEY,
        val INT,
        stuffing VARCHAR(200) NOT NULL,
        KEY ix_inner_val (val)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE t_outer (
        id INT NOT NULL PRIMARY KEY,
        val INT,
        stuffing VARCHAR(200) NOT NULL,
        KEY ix_outer_val (val)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_inner
SELECT  id,
        NULLIF(CEILING(RAND(20100527) * 100000), 100000),
        RPAD(&#039;&#039;, 200, &#039;*&#039;)
FROM    filler;

INSERT
INTO    t_outer
SELECT  id,
        NULLIF(CEILING(RAND(20100527 &lt;&lt; 1) * 100000), 100000),
        RPAD(&#039;&#039;, 200, &#039;*&#039;)
FROM    filler;
</pre>
</div>
<p>There are two identical <strong>MyISAM</strong> tables. Each of the tables contains <strong>1,000,000</strong> random values from <strong>1</strong> to <strong>99,999</strong> and also some <code>NULL</code> values. There is an index on <code>value</code> in both tables.</p>
<p>Now, let&#8217;s check the queries.</p>
<h3>NOT EXISTS</h3>
<pre class="brush: sql">
SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_inner i
        WHERE   i.val = o.val
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal">14600</td>
<td class="bigint">73</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (9.9061s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">o</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000000</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">i</td>
<td class="varchar">ref</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">20100527_anti.o.val</td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100527_anti.o.val&#39; of SELECT #2 was resolved in SELECT #1
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(exists(select NULL from `20100527_anti`.`t_inner` `i` where (`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`))))
</pre>
<p>The query completes in <strong>9.9</strong> seconds. As we can see, it is optimized to use the index on <code>t_inner.val</code> and return on the first match.</p>
<h3>LEFT JOIN / IS NULL</h3>
<pre class="brush: sql">
SELECT  SUM(LENGTH(o.stuffing)), COUNT(*)
FROM    t_outer o
LEFT JOIN
        t_inner i
ON      i.val = o.val
WHERE   i.id IS NULL
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(o.stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal">14600</td>
<td class="bigint">73</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (13.5154s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">o</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">i</td>
<td class="varchar">ref</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">20100527_anti.o.val</td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Not exists</td>
</tr>
</table>
</div>
<pre>
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`id`)
</pre>
<p>The query semantics are the same as those of <code>NOT EXISTS</code>, and we even see the <code>Not exists</code> optimization in the plan, however this query performs much more poorly than <code>NOT EXISTS</code>: <strong>13</strong> seconds. Why?</p>
<p><strong>MySQL</strong> documentation on <code>EXPLAIN</code> <a href="http://dev.mysql.com/doc/refman/5.5/en/using-explain.html">states</a> that <code>Not exists</code> is used to optimize the queries similar to the one we have just run: <code>LEFT JOIN</code> with <code>IS NULL</code> predicate applied to a non-nullable column.</p>
<p><strong>MySQL</strong> is aware that such a predicate can only be satisfied by a record resulting from a <code>JOIN</code> miss (i. e. when no matching record was found in the rightmost table) and stops reading records after first index hit.</p>
<p>However, this optimization is implemented in a way that is far from being perfect. Despite the fact that no actual value of <code>id</code> can be returned by such a query, the engine still looks up <code>id</code> in the table (since it&#8217;s not a part of the index). We can see it in the plan: unlike <code>NOT EXISTS</code> query, there is no <code>Using index</code> for <code>t_inner</code>. This means that a table lookup is performed.</p>
<p>Even we replace <code>id</code> with <code>val</code> in the query, it still performs poorly:</p>
<pre class="brush: sql">
SELECT  SUM(LENGTH(o.stuffing)), COUNT(*)
FROM    t_outer o
LEFT JOIN
        t_inner i
ON      i.val = o.val
WHERE   i.val IS NULL
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(o.stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal">14600</td>
<td class="bigint">73</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (14.4997s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">o</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">i</td>
<td class="varchar">ref</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">20100527_anti.o.val</td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`val`)
</pre>
<p>This time, no table lookups are made but there is no <code>Not exists</code> optimization either.</p>
<p>Despite the fact that the join condition eliminates possibility of an actual <code>NULL</code> being returned by the query and any <code>val IS NULL</code> reaching the <code>WHERE</code> clause is a result of a join miss, <strong>MySQL</strong> still examines all records in <code>t_inner</code>, not stopping after the first hit.</p>
<p>This had been submitted as a <a href="http://bugs.mysql.com/bug.php?id=47454">bug</a>.</p>
<p>Now, what about <code>NOT IN</code>?</p>
<h3>NOT IN</h3>
<p>Unlike the previous two queries that only differ in implementation, not in semantics, <code>NOT IN</code>, being applied as is, would yield the different results.</p>
<p><code>NOT EXISTS</code> and <code>IS NULL</code> are two-state predicates, they can only return <code>TRUE</code> or <code>FALSE</code>. <code>NOT IN</code> is a <em>three-state</em> predicate: it can return <code>TRUE</code>, <code>FALSE</code> or <code>NULL</code>.</p>
<p><code>NULL</code> value is returned in two cases:</p>
<ul>
<li>When <code>t_outer.value</code> being tested is <code>NULL</code></li>
<li>When <em>at least one</em> of <code>t_inner.value</code> is <code>NULL</code></li>
</ul>
<p>This means that having but a single <code>NULL</code> in <code>t_inner</code> would prevent the query from returning anything.</p>
<h4>Naive approach</h4>
<p>Let&#8217;s see what happens if we just substitute <code>NOT IN</code> instead of <code>NOT EXISTS</code>:</p>
<pre class="brush: sql">
SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal"></td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (10.3748s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">o</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000000</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">i</td>
<td class="varchar">index_subquery</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">func</td>
<td class="bigint">20</td>
<td class="double">100.00</td>
<td class="varchar">Using index; Full scan on NULL key</td>
</tr>
</table>
</div>
<pre>
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(&lt;in_optimizer&gt;(`20100527_anti`.`o`.`val`,&lt;exists&gt;(&lt;index_lookup&gt;(&lt;cache&gt;(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL having trigcond(&lt;is_not_null_test&gt;(`20100527_anti`.`i`.`val`)))))))
</pre>
<p>Since there are <code>NULL</code>s in <code>t_inner</code>, no record in <code>t_outer</code> can satisfy the predicate.</p>
<p><strong>MySQL</strong> does not optimize this very well. It takes but a single index scan to find out if there are <code>NULL</code> values in <code>t_inner</code> and return if they are, but for some reason <strong>MySQL</strong> still applies the condition to each record in <code>t_outer</code>.</p>
<h4>Naive approach, improved</h4>
<p>With a little help from our side, this can be improved:</p>
<pre class="brush: sql">
SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_inner i
        WHERE   val IS NULL
        )
        AND val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal"></td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0014s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Impossible WHERE</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">i</td>
<td class="varchar">index_subquery</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">func</td>
<td class="bigint">20</td>
<td class="double">100.00</td>
<td class="varchar">Using index; Full scan on NULL key</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">SUBQUERY</td>
<td class="varchar">i</td>
<td class="varchar">ref</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar"></td>
<td class="bigint">4</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where 0
</pre>
<p>We added an explicit check for <code>NULL</code> values. Since it&#8217;s not correlated, <strong>MySQL</strong> could instantly prove it false, cache it and avoid the table scan at all.</p>
<h4>Ignoring right side NULLs</h4>
<p>Now, let&#8217;s make a <code>NOT IN</code> query that does not take the <code>NULL</code> values in <code>t_inner</code> into account:</p>
<pre class="brush: sql">
SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        WHERE   val IS NOT NULL
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal">13400</td>
<td class="bigint">67</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (10.4060s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">o</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000000</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">i</td>
<td class="varchar">index_subquery</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">func</td>
<td class="bigint">20</td>
<td class="double">100.00</td>
<td class="varchar">Using index; Using where; Full scan on NULL key</td>
</tr>
</table>
</div>
<pre>
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(&lt;in_optimizer&gt;(`20100527_anti`.`o`.`val`,&lt;exists&gt;(&lt;index_lookup&gt;(&lt;cache&gt;(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(&lt;is_not_null_test&gt;(`20100527_anti`.`i`.`val`)))))))
</pre>
<p>This time, the query returns records, but not as many as the previous queries did.</p>
<p>We made an additional check for <code>NULL</code> in <code>t_inner</code> but not in <code>t_outer</code>. There are some records in <code>t_outer</code> that have a <code>NULL</code> in <code>val</code>. Both <code>IN</code> and <code>NOT IN</code> would evaluate to <code>NULL</code> and <code>WHERE</code> would filter them out.</p>
<p>We see another glitch in <strong>MySQL</strong> optimizer here: a <code>Full scan on NULL key</code> applied. Since <code>NOT IN</code> should always return <code>TRUE</code> when the subquery returns no records (even if the value checked is a <code>NULL</code>), on correlated queries a fullscan should be applied to check for the records and find out whether to return <code>NULL</code> or <code>FALSE</code>. However, in this case the <code>IN</code> subquery is not correlated, so the check could only be performed once and cached, like with the <code>LEFT JOIN</code>.</p>
<p>In our case the overhead would be negligible, since the subquery would return on the first match, but it could matter if we had more <code>NULL</code> values in <code>t_outer</code>.</p>
<p>Now, what if we want <code>NULL</code> records on <code>t_outer</code> to be returned as well? We just need to add an additional check for <code>NULL</code>s.</p>
<h4>Ignoring all <code>NULL</code>s</h4>
<pre class="brush: sql">
SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   val IS NULL
        OR val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        WHERE   val IS NOT NULL
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>SUM(LENGTH(stuffing))</th>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="decimal">14600</td>
<td class="bigint">73</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0002s (10.4842s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">o</td>
<td class="varchar">ALL</td>
<td class="varchar">ix_outer_val</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000000</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">i</td>
<td class="varchar">index_subquery</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">ix_inner_val</td>
<td class="varchar">5</td>
<td class="varchar">func</td>
<td class="bigint">20</td>
<td class="double">100.00</td>
<td class="varchar">Using index; Using where; Full scan on NULL key</td>
</tr>
</table>
</div>
<pre>
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (isnull(`20100527_anti`.`o`.`val`) or (not(&lt;in_optimizer&gt;(`20100527_anti`.`o`.`val`,&lt;exists&gt;(&lt;index_lookup&gt;(&lt;cache&gt;(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(&lt;is_not_null_test&gt;(`20100527_anti`.`i`.`val`))))))))
</pre>
<p>Here, the query returns the same results as <code>NOT EXISTS</code>.</p>
<p><code>Full scan on NULL key</code> is still present in the plan but will never actually be executed because it will be short circuited by the previous <code>IS NULL</code> check.</p>
<h3>Summary</h3>
<p>As was shown in the <a href="/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/">earlier article</a>, <code>LEFT JOIN / IS NULL</code> and <code>NOT IN</code> are best used to implement an anti-join in <strong>MySQL</strong> if the columns on both sides are not nullable.</p>
<p>The situation is different when the columns are nullable:</p>
<ul>
<li><code>NOT EXISTS</code> performs in most straightforward way: just checks equality and returns <code>TRUE</code> or <code>FALSE</code> on the first hit / miss.</li>
<li><code>LEFT JOIN / IS NULL</code> either makes an additional table lookup or does not return on the first match and performs more poorly in both cases.</li>
<li><code>NOT IN</code>, having different semantics, requires additional checks for <code>NULL</code> values. These checks should be coded into the query</li>
</ul>
<p>With nullable columns, <code>NOT EXISTS</code> and <code>NOT IN</code> (with additional checks for <code>NULLS</code>) are the most efficient methods to implement an anti-join in <strong>MySQL</strong>.</p>
<p><code>LEFT JOIN / IS NULL</code> performs poorly.</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/05/27/left-join-is-null-vs-not-in-vs-not-exists-nullable-columns/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MAX and MIN on a composite index</title>
		<link>http://explainextended.com/2010/05/08/max-and-min-on-a-composite-index/</link>
		<comments>http://explainextended.com/2010/05/08/max-and-min-on-a-composite-index/#comments</comments>
		<pubDate>Sat, 08 May 2010 19:00:47 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4733</guid>
		<description><![CDATA[Answering questions asked on the site. Ivo Radev asks: I am trying to make a very simple query. We have a log table which different machines write to. Given the machine list, I need to find the latest log timestamp. Currently, the query looks like this: SELECT MAX(log_time) FROM log_table WHERE log_machine IN ($machines) , [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Ivo Radev</strong> asks:</p>
<blockquote><p>I am trying to make a very simple query.</p>
<p>We have a log table which different machines write to. Given the machine list, I need to find the latest log timestamp.</p>
<p>Currently, the query looks like this:</p>
<pre class="brush: sql">
SELECT  MAX(log_time)
FROM    log_table
WHERE   log_machine IN ($machines)
</pre>
<p>, and I pass the comma-separated list of <code>$machines</code> from <strong>PHP</strong>.</p>
<p>The weird thing is that the query is literally instant when there is only one machine (any) in the list but slow when there are multiple machines.</p>
<p>I&#8217;m considering doing it in separate queries and then process the results in PHP. However I&#8217;d like to know if there is a fast solution in MySQL.</p></blockquote>
<p>Most probably, there is a composite index on <code>(log_machine, log_time)</code> which is being used for the query.</p>
<p>Usually, a query like this:</p>
<pre class="brush: sql">
SELECT  MAX(log_time)
FROM    log_table
</pre>
<p>on the indexed field <code>log_time</code> can be served with a single index seek on the index.</p>
<p>Indeed, the <code>MAX(log_time)</code>, by definition, is the latest entry in the index order, and can be fetched merely by finding the trailing index entry. It&#8217;s a matter of several page reads in the <code>B-Tree</code>, each one following the rightmost link to the lower-level page.</p>
<p>Similarly, this query:</p>
<pre class="brush: sql">
SELECT  MAX(log_time)
FROM    log_table
WHERE   log_machine = $my_machine
</pre>
<p>can be served with a single index seek too. However, the index should include <code>log_machine</code> as a leading column.</p>
<p>In this case, a set of records satisfying the <code>WHERE</code> clause of the query is represented by a single logically continuous block of records in the index, each one sharing the same value of <code>log_machine</code>. <code>MAX(log_time)</code> will of course be held by the last record in this block. <strong>MySQL</strong> just finds that last record and takes the <code>log_time</code> out of it.</p>
<p>Now, what if we have a multiple condition on <code>log_machine</code>?<br />
<span id="more-4733"></span><br />
The index remains the same, but the record holding <code>MAX(log_time)</code> is not the last record in a single continuous block anymore. Instead, there are multiple blocks each having its own <code>MAX(log_date)</code>. <code>log_time</code> cannot be found merely by taking the last record from the index block: it is not known which one is the correct one.</p>
<p>On composite indexes, however, <strong>MySQL</strong> offers <a href="http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html"><strong>loose index scan</strong></a>. This means that it jumps over the distinct values of the leading column, doing an index seek (instead of index scan) to retrieve each next value.</p>
<p>As stated in the documentation, this method is ideal to doing the queries like that:</p>
<pre class="brush: sql">
SELECT  log_machine, MAX(log_time)
FROM    log_table
WHERE   log_machine IN ($my_machine_list)
</pre>
<p>As we said earlier, for each <code>log_machine</code>, its <code>MAX(log_time)</code> can be returned very fast, and the list of the <code>log_machines</code> could be obtained with a loose index scan, by seeking the keys in the index.</p>
<p>This query, however, will not produce a single <code>MAX(log_time)</code>: instead, it will return as many maximums as there are values in the list (which are found in the table, of course).</p>
<p>But this can be easily worked around: we just select the greatest one of these records. Since the subquery will only return several records, the greatest one if them can be found almost instantly.</p>
<p>Let&#8217;s create a sample table:</p>
<p><a href="#" onclick="xcollapse('X3511');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X3511" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE log_table (
        id INT NOT NULL PRIMARY KEY,
        log_machine VARCHAR(20) NOT NULL,
        log_time DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    log_table
SELECT  id,
        CONCAT(&#039;Machine &#039; , CEILING(RAND(20100508) * 10)),
        &#039;2010-05-08&#039; - INTERVAL CEILING(RAND(20100508 &lt;&lt; 1) * 10000000) SECOND
FROM    filler f1;

CREATE INDEX ix_log_machine_time ON log_table (log_machine, log_time);
</pre>
</div>
<p>The table has <strong>1,000,000</strong> records</p>
<p>Now, let&#8217;s see how the original query performs:</p>
<pre class="brush: sql">
SELECT  MAX(log_time) AS maxtime
FROM    log_table
WHERE   log_machine IN (&#039;Machine 3&#039;, &#039;Machine 5&#039;, &#039;Machine 7&#039;, &#039;Machine 9&#039;)
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>maxtime</th>
</tr>
<tr>
<td class="timestamp">2010-05-07 23:59:49</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.6406s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">log_table</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_machine_time</td>
<td class="varchar">ix_log_machine_time</td>
<td class="varchar">62</td>
<td class="varchar"></td>
<td class="bigint">826326</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
select max(`20100508_max`.`log_table`.`log_time`) AS `maxtime` from `20100508_max`.`log_table` where (`20100508_max`.`log_table`.`log_machine` in (&#39;Machine 3&#39;,&#39;Machine 5&#39;,&#39;Machine 7&#39;,&#39;Machine 9&#39;))
</pre>
<p>The query uses <code>range</code> access to retrieve the records and browses all records to find the maximum. It takes <strong>640 ms</strong> on a table <strong>1,000,000</strong> log records (which is about a day&#8217;s output of a single web server under a load decent but not super hard).</p>
<p>Now, let&#8217;s try to select the greatest of the group-wise maximums:</p>
<pre class="brush: sql">
SELECT  MAX(log_time) AS maxtime
FROM    log_table
WHERE   log_machine IN (&#039;Machine 3&#039;, &#039;Machine 5&#039;, &#039;Machine 7&#039;, &#039;Machine 9&#039;)
GROUP BY
        log_machine
ORDER BY
        1 DESC
LIMIT 1
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>maxtime</th>
</tr>
<tr>
<td class="timestamp">2010-05-07 23:59:49</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0020s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">log_table</td>
<td class="varchar">range</td>
<td class="varchar">ix_log_machine_time</td>
<td class="varchar">ix_log_machine_time</td>
<td class="varchar">62</td>
<td class="varchar"></td>
<td class="bigint">16</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index for group-by; Using temporary; Using filesort</td>
</tr>
</table>
</div>
<pre>
select max(`20100508_max`.`log_table`.`log_time`) AS `maxtime` from `20100508_max`.`log_table` where (`20100508_max`.`log_table`.`log_machine` in (&#39;Machine 3&#39;,&#39;Machine 5&#39;,&#39;Machine 7&#39;,&#39;Machine 9&#39;)) group by `20100508_max`.`log_table`.`log_machine` order by 1 desc limit 1
</pre>
<p>Now, it&#8217;s instant, as it should be.</p>
<p>As it often happens, by appending three seemingly redundant clauses to a query we made <strong>MySQL</strong> to choose a more efficient plan and the query is now instant even with multiple machines in the list.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/05/08/max-and-min-on-a-composite-index/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Groups holding highest ranked items</title>
		<link>http://explainextended.com/2010/04/22/groups-holding-highest-ranked-items/</link>
		<comments>http://explainextended.com/2010/04/22/groups-holding-highest-ranked-items/#comments</comments>
		<pubDate>Thu, 22 Apr 2010 19:00:51 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4700</guid>
		<description><![CDATA[Answering questions asked on the site. Nate asks: I know you&#8217;ve addressed similar issues related to the greatest-per-group query but this seems to be a different take on that. Example table: t_group item_id group_id score 100 1 2 100 2 3 200 1 1 300 1 4 300 2 2 Each item may be in [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Nate</strong> asks:</p>
<blockquote><p>I know you&#8217;ve addressed similar issues related to the <a href="/2009/11/24/mysql-selecting-records-holding-group-wise-maximum-on-a-unique-column/"><q>greatest-per-group</q> query</a> but this seems to be a different take on that.</p>
<p>Example table:</p>
<table class="excel">
<caption>t_group</caption>
<tr>
<th>item_id</th>
<th>group_id</th>
<th>score</th>
</tr>
<tr>
<td>100</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>100</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>200</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>300</td>
<td>1</td>
<td>4</td>
</tr>
<tr>
<td>300</td>
<td>2</td>
<td>2</td>
</tr>
</table>
<p>Each item may be in multiple groups.  Each instance of an item in that group is given a score (how relevant it is the the group).</p>
<p>So given the data above, when querying for group <strong>1</strong> it should return items <strong>200</strong> and <strong>300</strong> (item <strong>100</strong>&#8216;s highest score is for group <strong>2</strong>, so it&#8217;s excluded).
</p></blockquote>
<p>The classical <q>greatest-n-per-group</q> problem requires selecting a single record from each group holding a group-wise maximum. This case is a little bit different: for a given group, we need to select all records holding an item-wise maximum.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-4700"></span><br />
<a href="#" onclick="xcollapse('X3879');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X3879" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_groups (
        item_id INT NOT NULL,
        group_id INT NOT NULL,
        score INT NOT NULL,
        PRIMARY KEY (group_id, item_id),
        KEY ix_groups_gsi (item_id, score, group_id)
) ENGINE=InnoDB;              

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_groups
SELECT  (id - 1) % 1000 + 1,
        (id - 1) div 1000 + 1,
        CEILING(RAND(20100422) * 10000)
FROM    filler;
</pre>
</div>
<p>This table contains <strong>1,000,000</strong> records: <strong>1,000</strong> items in <strong>1,000</strong> groups with random scores.</p>
<p>Let&#8217;s write a query which would return us all items whose largest score is in group <strong>1</strong>.</p>
<p>To do this, we need to select all items from group <strong>1</strong> and check that no other group has a greater value of <code>score</code> for that item. The most intuitive query for this would look like that:</p>
<pre class="brush: sql">
SELECT  *
FROM    t_groups go
WHERE   group_id = 1
        AND NOT EXISTS
        (
        SELECT  group_id
        FROM    t_groups gi
        WHERE   gi.item_id = go.item_id
                AND (gi.score, gi.group_id) &gt; (go.score, go.group_id)
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>item_id</th>
<th>group_id</th>
<th>score</th>
</tr>
<tr>
<td class="integer">288</td>
<td class="integer">1</td>
<td class="integer">9997</td>
</tr>
<tr>
<td class="integer">778</td>
<td class="integer">1</td>
<td class="integer">9995</td>
</tr>
<tr>
<td class="integer">970</td>
<td class="integer">1</td>
<td class="integer">9999</td>
</tr>
<tr class="statusbar">
<td colspan="100">3 rows fetched in 0.0002s (0.4210s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">go</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">1496</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">gi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_groups_gsi</td>
<td class="varchar">ix_groups_gsi</td>
<td class="varchar">4</td>
<td class="varchar">20100422_rank.go.item_id</td>
<td class="bigint">366</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100422_rank.go.item_id&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20100422_rank.go.score&#39; of SELECT #2 was resolved in SELECT #1
Field or reference &#39;20100422_rank.go.group_id&#39; of SELECT #2 was resolved in SELECT #1
select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (not(exists(select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where ((`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) and ((`20100422_rank`.`gi`.`score`,`20100422_rank`.`gi`.`group_id`) &gt; (`20100422_rank`.`go`.`score`,`20100422_rank`.`go`.`group_id`)))))))
</pre>
<p>This query returns us <strong>3</strong> items and their score. We see that these items are not ranked higher in any other group.</p>
<p>However, this query is quite inefficient: it takes almost half a second. This is way too long for <strong>1K</strong> items. Let&#8217;s look how can it be optimized.</p>
<p>If we look into the query plan we will see that no <code>range</code> access path is used for the subquery, despite the condition that can easily be optimized used such an access path. Instead, <strong>MySQL</strong> uses only the <code>ref</code> access on <code>item_id</code>, combined with <code>Using where; Using index</code>.</p>
<p>What does it all mean?</p>
<p>We have the composite index on <code>(item_id, score, group_id)</code>. This means that within the index, the records are ordered first on <code>item_id</code> then by <code>score</code> and then, in a case of a tie, by <code>group_id</code>.</p>
<p>For group <strong>1</strong>, <strong>MySQL</strong> should perform <strong>1,000</strong> comparisons: for each item within the group, the engine should make sure that no other group has a higher score for the same item.</p>
<p>Ideally, <strong>MySQL</strong> should have found each given set of <code>(item_id, score, group_id)</code> in the index and then just make a single next key search to check if this record is last in the index within the given <code>item_id</code>. That would show as a <code>range</code> search in the query plan, since we actually are checking values between <code>(item_id, score, group_id)</code> and <code>(item_id, +INF, +INF)</code>.</p>
<p>However, <strong>MySQL</strong> cannot do such things in a correlated subquery. Instead, it uses the <code>ref</code> access path: takes all records with the given <code>item_id</code> (i. e. between <code>(item_id, -INF, -INF)</code> and <code>(item_id, +INF, +INF)</code>) and traverses them applying the <code>WHERE</code> filter to each record.</p>
<p>For group <strong>1</strong>, only <strong>3</strong> items are returned. This means that for majority of items (<strong>997</strong> items), the whole index range for the items (<strong>1,000</strong> records per item) had to be scanned.</p>
<p>And since the matching record is always the last one in the index range (because it&#8217;s the one with the greatest score), this means that even for returned items, the whole ranged had to be scanned too. The only difference is that the last record in the range satisfies the <code>WHERE</code> condition. No wonder it takes so long.</p>
<p>To speed up the query we need to trick <strong>MySQL</strong> a little.</p>
<p>Since it&#8217;s always the last record in the index range that satisfies the <code>WHERE</code> condition, why don&#8217;t we just take it and compare? If we see that it holds our group, we return <code>TRUE</code>, if we don&#8217;t, we return <code>FALSE</code> right away.</p>
<p>Here&#8217;s how we can do this:</p>
<pre class="brush: sql">
SELECT  *
FROM    t_groups go
WHERE   group_id = 1
        AND group_id =
        (
        SELECT  group_id
        FROM    t_groups gi
        WHERE   gi.item_id = go.item_id
        ORDER BY
                item_id DESC, score DESC, group_id DESC
        LIMIT 1
        )
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>item_id</th>
<th>group_id</th>
<th>score</th>
</tr>
<tr>
<td class="integer">288</td>
<td class="integer">1</td>
<td class="integer">9997</td>
</tr>
<tr>
<td class="integer">778</td>
<td class="integer">1</td>
<td class="integer">9995</td>
</tr>
<tr>
<td class="integer">970</td>
<td class="integer">1</td>
<td class="integer">9999</td>
</tr>
<tr class="statusbar">
<td colspan="100">3 rows fetched in 0.0002s (0.0100s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">go</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">const</td>
<td class="bigint">1496</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">gi</td>
<td class="varchar">ref</td>
<td class="varchar">ix_groups_gsi</td>
<td class="varchar">ix_groups_gsi</td>
<td class="varchar">4</td>
<td class="varchar">20100422_rank.go.item_id</td>
<td class="bigint">366</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100422_rank.go.item_id&#39; of SELECT #2 was resolved in SELECT #1
select `20100422_rank`.`go`.`item_id` AS `item_id`,`20100422_rank`.`go`.`group_id` AS `group_id`,`20100422_rank`.`go`.`score` AS `score` from `20100422_rank`.`t_groups` `go` where ((`20100422_rank`.`go`.`group_id` = 1) and (1 = (select `20100422_rank`.`gi`.`group_id` AS `group_id` from `20100422_rank`.`t_groups` `gi` where (`20100422_rank`.`gi`.`item_id` = `20100422_rank`.`go`.`item_id`) order by `20100422_rank`.`gi`.`item_id` desc,`20100422_rank`.`gi`.`score` desc,`20100422_rank`.`gi`.`group_id` desc limit 1)))
</pre>
<p>Instead of checking the <em>existence</em> of the last row in a range, we check it&#8217;s <em>value</em>. This means that exactly one index record will be evaluated for each item within the group.</p>
<p>Note that the execution plan used by this query looks <em>exactly</em> the same as the first query&#8217;s one. Same <code>ref</code> condition on <code>item_id</code>, same <code>Using where; Using index</code>. The only difference (not shown in the <code>EXPLAIN</code> output) it that the index is traversed in <em>descending</em> order now. Instead of fetching <strong>1,000</strong> records from the beginning to check the existence of the last one, we just take one record from the end and check its value.</p>
<p>Those who read my blog already know why do we use seemingly redundant <code>item_id DESC</code> in the <code>ORDER BY</code> clause here. To make <strong>MySQL</strong> to use descending index access path instead of a filesort in an ordered query, we should list <em>all</em> clauses that constitute the index, even if some of them are filtered by the <code>WHERE</code> condition.</p>
<p>The second query completes in only <strong>10 ms</strong> which is <strong>40</strong> times as fast as the original query.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/04/22/groups-holding-highest-ranked-items/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hierarchical query in MySQL: limiting parents</title>
		<link>http://explainextended.com/2010/04/18/hierarchical-query-in-mysql-limiting-parents/</link>
		<comments>http://explainextended.com/2010/04/18/hierarchical-query-in-mysql-limiting-parents/#comments</comments>
		<pubDate>Sun, 18 Apr 2010 19:00:08 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4691</guid>
		<description><![CDATA[Answering questions asked on the site. James asks: Your series on hierarchical queries in MySQL is tremendous! I&#8217;m using it to create a series of threaded conversations. I&#8217;m wondering if there is a way to paginate these results. Specifically, let&#8217;s say I want to limit the conversations to return 10 root nodes (parent=0) and all [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>James</strong> asks:</p>
<blockquote><p>Your series on <a href="/2009/03/17/hierarchical-queries-in-mysql/">hierarchical queries in MySQL</a> is tremendous! I&#8217;m using it to create a series of threaded conversations.</p>
<p>I&#8217;m wondering if there is a way to paginate these results.</p>
<p>Specifically, let&#8217;s say I want to limit the conversations to return <strong>10</strong> root nodes (<code>parent=0</code>) and all of their children in a query.</p>
<p>I can&#8217;t just limit the final query, because that will clip off children. I&#8217;ve tried to add <code>LIMIT</code>s to your stored functions, but I&#8217;m not getting the magic just right.</p>
<p>How would you go about doing this?
</p></blockquote>
<p>A quick reminder: <strong>MySQL</strong> does not support recursion (either <code>CONNECT BY</code> style or recursive <strong>CTE</strong> style), so using an adjacency list model is a somewhat complicated task.</p>
<p>However, it is still possible. The main idea is storing the recursion state in a session variable and call a user-defined function repeatedly to iterate over the tree, thus emulating recursion. The article mentioned in the question shows how to do that.</p>
<p>Normally, reading and assigning session variables in the same query is discouraged in <strong>MySQL</strong>, since the order of evaluation is not guaranteed. However, in the case we only use the table as a dummy recordset and no values of the records are actually used in the function, so the actual values returned by the function are completely defined by the function itself. The table is only used to ensure that the function is called enough times, and to present its results in form of a native resultset (which can be returned or joined with).</p>
<p>To do something with the logic of the function (like, imposing a limit on the parent nodes without doing the same on the child nodes), we, therefore, should tweak the function code, not the query that calls the functions. The only thing that matters in such a query is the number of records returned and we don&#8217;t know it in design time.</p>
<p>Limiting the parent nodes is quite simple: we just use another session variable to track the number of parent branches yet to be returned and stop processing as soon as the limit is hit, that is the variable becomes zero.</p>
<p>Let&#8217;s create a sample table and see how to do this:<br />
<span id="more-4691"></span><br />
<a href="#" onclick="xcollapse('X1510');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X1510" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_hierarchy (
        id int(10) unsigned NOT NULL AUTO_INCREMENT,
        parent int(10) unsigned NOT NULL,
        root INT NOT NULL,
        PRIMARY KEY (id),
        KEY ix_hierarchy_parent (parent, id),
        KEY ix_hierarchy_root (root)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_hierarchy
SELECT  id,
        CASE (id - 1) % 8
        WHEN 0 THEN 0
        ELSE FLOOR((((id - 1) % 8) + 1) / 2) + ((id - 1) div 8) * 8
        END,
        ((id - 1) div 8) * 8 + 1
FROM    filler;
</pre>
</div>
<p>There are <strong>100,000</strong> hierarchical records in multiple trees (<strong>8</strong> records in each tree).</p>
<p>To limit the number of trees returned, we create the function similar to the one created in the earlier posts and add a little condition that would decrease the session variable, <code>@parent_limit</code>, each time a parent entry is returned. When this variable hits zero, it&#8217;s a signal to stop processing the records:</p>
<pre class="brush: sql">
CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
        DECLARE _id INT;
        DECLARE _parent INT;
        DECLARE _next INT;
        DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL;

        SET _parent = @id;
        SET _id = -1;

        IF @id IS NULL THEN
                RETURN NULL;
        END IF;

        LOOP
                SELECT  MIN(id)
                INTO    @id
                FROM    t_hierarchy
                WHERE   parent = _parent
                        AND id &gt; _id;
                IF @id IS NOT NULL OR _parent = @start_with THEN
                        SET @level = @level + 1;
                        IF _parent = @start_with AND @parent_limit &gt; 0 THEN
                                SET @parent_limit = @parent_limit - 1;
                        END IF;
                        IF @parent_limit = 0 THEN
                                SET @id = NULL;
                        END IF;
                        RETURN @id;
                END IF;
                SET @level := @level - 1;
                SELECT  id, parent
                INTO    _id, _parent
                FROM    t_hierarchy
                WHERE   id = _parent;
        END LOOP;
END;
</pre>
<p>Let&#8217;s check it:</p>
<pre class="brush: sql">
SELECT  CONCAT(REPEAT(&#039;    &#039;, level - 1), CAST(hi.id AS CHAR)) AS treeitem, parent, level, lmt
FROM    (
        SELECT  hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level, @parent_limit as lmt
        FROM    (
                SELECT  @start_with := 0,
                        @parent_limit := 4,
                        @id := @start_with,
                        @level := 0
                ) vars, t_hierarchy
        WHERE   @id IS NOT NULL
        ) ho
JOIN    t_hierarchy hi
ON      hi.id = ho.id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>treeitem</th>
<th>parent</th>
<th>level</th>
<th>lmt</th>
</tr>
<tr>
<td class="blob">1</td>
<td class="integer">0</td>
<td class="blob">1</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">    2</td>
<td class="integer">1</td>
<td class="blob">2</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">        4</td>
<td class="integer">2</td>
<td class="blob">3</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">            8</td>
<td class="integer">4</td>
<td class="blob">4</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">        5</td>
<td class="integer">2</td>
<td class="blob">3</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">    3</td>
<td class="integer">1</td>
<td class="blob">2</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">        6</td>
<td class="integer">3</td>
<td class="blob">3</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">        7</td>
<td class="integer">3</td>
<td class="blob">3</td>
<td class="blob">3</td>
</tr>
<tr>
<td class="blob">9</td>
<td class="integer">0</td>
<td class="blob">1</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">    10</td>
<td class="integer">9</td>
<td class="blob">2</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">        12</td>
<td class="integer">10</td>
<td class="blob">3</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">            16</td>
<td class="integer">12</td>
<td class="blob">4</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">        13</td>
<td class="integer">10</td>
<td class="blob">3</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">    11</td>
<td class="integer">9</td>
<td class="blob">2</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">        14</td>
<td class="integer">11</td>
<td class="blob">3</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">        15</td>
<td class="integer">11</td>
<td class="blob">3</td>
<td class="blob">2</td>
</tr>
<tr>
<td class="blob">17</td>
<td class="integer">0</td>
<td class="blob">1</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">    18</td>
<td class="integer">17</td>
<td class="blob">2</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">        20</td>
<td class="integer">18</td>
<td class="blob">3</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">            24</td>
<td class="integer">20</td>
<td class="blob">4</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">        21</td>
<td class="integer">18</td>
<td class="blob">3</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">    19</td>
<td class="integer">17</td>
<td class="blob">2</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">        22</td>
<td class="integer">19</td>
<td class="blob">3</td>
<td class="blob">1</td>
</tr>
<tr>
<td class="blob">        23</td>
<td class="integer">19</td>
<td class="blob">3</td>
<td class="blob">1</td>
</tr>
<tr class="statusbar">
<td colspan="100">24 rows fetched in 0.0008s (0.0711s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">25</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">hi</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">ho.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">system</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_hierarchy</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">100650</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
</table>
</div>
<pre>
select concat(repeat(&#39;    &#39;,(`ho`.`level` - 1)),cast(`20100418_limit`.`hi`.`id` as char charset latin1)) AS `treeitem`,`20100418_limit`.`hi`.`parent` AS `parent`,`ho`.`level` AS `level`,`ho`.`lmt` AS `lmt` from (select `hierarchy_connect_by_parent_eq_prior_id`(`20100418_limit`.`t_hierarchy`.`id`) AS `id`,(@level) AS `level`,(@parent_limit) AS `lmt` from (select (@start_with:=0) AS `@start_with := 0`,(@parent_limit:=4) AS `@parent_limit := 4`,(@id:=(@start_with)) AS `@id := @start_with`,(@level:=0) AS `@level := 0`) `vars` join `20100418_limit`.`t_hierarchy` where ((@id) is not null)) `ho` join `20100418_limit`.`t_hierarchy` `hi` where (`20100418_limit`.`hi`.`id` = `ho`.`id`)
</pre>
<p>The first <strong>3</strong> branches in a proper hierarchy, almost instantly. Note that we need set <code>@parent_limit</code> to <strong>4</strong>, i. e. one greater than the value we need.</p>
<p>Note that it is possible to achieve the similar behavior without using any recursive functions at all.</p>
<p>On most forums the filtering is performed on the topic starters, so it&#8217;s often a good idea to store the id of the topic starter along with each reply. In the sample table a did that as well.</p>
<p>With this, the filtering required becomes very simple:</p>
<pre class="brush: sql">
SELECT  h.*
FROM    (
        SELECT  id
        FROM    t_hierarchy
        WHERE   parent = 0
        ORDER BY
                id
        LIMIT 3
        ) q
JOIN    t_hierarchy h
ON      h.root = q.id
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>parent</th>
<th>root</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="integer">0</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="integer">1</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="integer">1</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">4</td>
<td class="integer">2</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">5</td>
<td class="integer">2</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">6</td>
<td class="integer">3</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">7</td>
<td class="integer">3</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">8</td>
<td class="integer">4</td>
<td class="integer">1</td>
</tr>
<tr>
<td class="integer">9</td>
<td class="integer">0</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">10</td>
<td class="integer">9</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">11</td>
<td class="integer">9</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">12</td>
<td class="integer">10</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">13</td>
<td class="integer">10</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">14</td>
<td class="integer">11</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">15</td>
<td class="integer">11</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">16</td>
<td class="integer">12</td>
<td class="integer">9</td>
</tr>
<tr>
<td class="integer">17</td>
<td class="integer">0</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">18</td>
<td class="integer">17</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">19</td>
<td class="integer">17</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">20</td>
<td class="integer">18</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">21</td>
<td class="integer">18</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">22</td>
<td class="integer">19</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">23</td>
<td class="integer">19</td>
<td class="integer">17</td>
</tr>
<tr>
<td class="integer">24</td>
<td class="integer">20</td>
<td class="integer">17</td>
</tr>
<tr class="statusbar">
<td colspan="100">24 rows fetched in 0.0007s (0.0024s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">h</td>
<td class="varchar">ref</td>
<td class="varchar">ix_hierarchy_root</td>
<td class="varchar">ix_hierarchy_root</td>
<td class="varchar">4</td>
<td class="varchar">q.id</td>
<td class="bigint">4</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_hierarchy</td>
<td class="varchar">ref</td>
<td class="varchar">ix_hierarchy_parent</td>
<td class="varchar">ix_hierarchy_parent</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">16830</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
</table>
</div>
<pre>
select `20100418_limit`.`h`.`id` AS `id`,`20100418_limit`.`h`.`parent` AS `parent`,`20100418_limit`.`h`.`root` AS `root` from (select `20100418_limit`.`t_hierarchy`.`id` AS `id` from `20100418_limit`.`t_hierarchy` where (`20100418_limit`.`t_hierarchy`.`parent` = 0) order by `20100418_limit`.`t_hierarchy`.`id` limit 3) `q` join `20100418_limit`.`t_hierarchy` `h` where (`20100418_limit`.`h`.`root` = `q`.`id`)
</pre>
<p>This solution (that had been used on numerous forum engines for ages) is more efficient, since no function calls are involved, and more simple too.</p>
<p>However, it does not preserve the hierarchical order and does not allow sorting on anything but the topic starter, so if you need anything of these, the recursive function is still a way to go.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/04/18/hierarchical-query-in-mysql-limiting-parents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS</title>
		<link>http://explainextended.com/2010/04/02/multiple-attributes-in-a-eav-table-group-by-vs-not-exists/</link>
		<comments>http://explainextended.com/2010/04/02/multiple-attributes-in-a-eav-table-group-by-vs-not-exists/#comments</comments>
		<pubDate>Fri, 02 Apr 2010 19:00:15 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4647</guid>
		<description><![CDATA[Answering questions asked on the site. Andrew Stillard asks: I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id. Users search for products based on the [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Andrew Stillard</strong> asks:</p>
<blockquote>
<p>I have a store which will hold around <strong>50,000</strong> products in a products table. Each product will have <strong>14</strong> options, giving <strong>700,000</strong> options in total. These are held in an options table which is joined via the product id. </p>
<p>Users search for products based on the options via an Advanced Search menu. </p>
<p>The users need to be able to select multiple options upon which to query. I would normally use a <code>JOIN</code> if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through the <code>WHERE EXISTS</code> statement. </p>
<p>The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).<br />
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.</p>
</blockquote>
<p>The option table mentioned here is in fact an implementation of the <a href="http://en.wikipedia.org/wiki/Entity-attribute-value_model">EAV</a> model in a relational database.</p>
<p>Each record basically contains <strong>3</strong> things: <code>id</code> of the product it describes; <code>id</code> of the option and the value of the given option for the given product. These fields represent <strong>entity</strong>, <strong>attribute</strong> and <strong>value</strong>, respectively.</p>
<p>This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the <strong>EAV</strong> table with the name and the value of the attribute. This is a <strong>DML</strong> operation rather than a <strong>DDL</strong> one.</p>
<p>However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.</p>
<p>There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:</p>
<ol>
<li>For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using <code>COUNT(*)</code> as a filter. The number of entity entries should be equal to the number of the attributes.</li>
<li>Takes each entity and for each attribute, check if the condition holds.</li>
</ol>
<p>The first approach uses a <code>GROUP BY</code>, the second one uses <code>EXISTS</code>.</p>
<p>Let&#8217;s create a sample table and see which one is more efficient:<br />
<span id="more-4647"></span><br />
<a href="#" onclick="xcollapse('X7780');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X7780" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_product (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_option (
        product_id INT NOT NULL,
        option_id INT NOT NULL,
        value INT NOT NULL,
        PRIMARY KEY pk_option_o_p (product_id, option_id),
        KEY ix_option_o_v (option_id, value)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    t_product
SELECT  id, CONCAT(&#039;Product &#039;, id)
FROM    filler
ORDER BY
        id
LIMIT 50000;

INSERT
INTO    t_option
SELECT  p.id, f.id, CEILING(RAND(20100402 &lt;&lt; 1) * 100)
FROM    t_product p
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        ORDER BY
                id
        LIMIT 40
        ) f;
</pre>
</div>
<p>Table <code>t_product</code> contains <strong>50,000</strong> products; table <code>t_option</code> (an <strong>EAV</strong> table) contains the values of <strong>40</strong> attributes for each products, randomly filled.</p>
<p>Assume we got a complex query involving attributes from <strong>1</strong> to <strong>6</strong>. The value of attribute <strong>1</strong> should be from <strong>0</strong> to <strong>100</strong>; that of attribute <strong>2</strong> should be from <strong>10</strong> to <strong>90</strong>, etc., finally, the value of the attribute <strong>6</strong> should be from <strong>50</strong> to <strong>50</strong>.</p>
<h3>GROUP BY</h3>
<p>Using the first approach, we should find all <strong>product_id</strong>&#8216;s in the <strong>EAV</strong> table (<strong>t_option</strong>) that satisfy the ranges for each attribute, then aggregate these products and filter those whose <code>COUNT(*)</code> is <strong>6</strong>. This will mean that the product was found in all <strong>6</strong> ranges and hence satisfies all conditions. </p>
<p>Here&#8217;s the query to do that:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  product_id
        FROM    (
                SELECT  product_id
                FROM    (
                        SELECT  1 AS opt, 0 AS l, 100 AS h
                        UNION ALL
                        SELECT  2 AS opt, 10 AS l, 90 AS h
                        UNION ALL
                        SELECT  3 AS opt, 20 AS l, 80 AS h
                        UNION ALL
                        SELECT  4 AS opt, 30 AS l, 70 AS h
                        UNION ALL
                        SELECT  5 AS opt, 40 AS l, 60 AS h
                        UNION ALL
                        SELECT  6 AS opt, 50 AS l, 50 AS h
                        ) v
                JOIN    t_option o
                ON      o.option_id &gt;= opt
                        AND o.option_id &lt;= opt
                        AND o.value BETWEEN l AND h
                ) o
        GROUP BY
                o.product_id
        HAVING  COUNT(*) = 6
        ) q;
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">22</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.3291s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">152657</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">6</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">o</td>
<td class="varchar">range</td>
<td class="varchar">ix_option_o_v</td>
<td class="varchar">ix_option_o_v</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">1090</td>
<td class="double">183532.11</td>
<td class="varchar">Range checked for each record (index map: 0&#215;2)</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">7</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">8</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">9</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union4,5,6,7,8,9&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
select count(0) AS `COUNT(*)` from (select `o`.`product_id` AS `product_id` from (select `20100402_options`.`o`.`product_id` AS `product_id` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` join `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`option_id` &gt;= `v`.`opt`) and (`20100402_options`.`o`.`option_id` &lt;= `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))) `o` group by `o`.`product_id` having (count(0) = 6)) `q`
</pre>
<p>Note that there are two quite counterintuitive things in the query:</p>
<ul>
<li><code>o.option_id &gt;= opt AND o.option_id &lt;= opt</code> instead of mere <code>o.option_id = opt</code></li>
<li><code>GROUP BY</code> is applied to a nested query rather than being put immediately after the <code>WHERE</code> clause</li>
</ul>
<p>As many readers of my blog will remember, these tricks are intended to make <strong>MySQL</strong> use <code>Range checked for each record</code> that we can spot in the plan. If not for these tricks, <strong>MySQL</strong> would use an index scan on only a part of the composite index on <code>(attribute, value, product)</code>, since the joins on the mixed equality/range conditions are not optimized well in the current releases (at least up to <strong>5.1.42</strong>).</p>
<p>This query is quite fast: only <strong>320 ms</strong>.</p>
<h3>NOT EXISTS</h3>
<p>Using this approach, the engine takes each product and checks if all of the relevant attributes satisfy the conditions. As soon as the first attribute failing the check is found, the condition is considered <strong>FALSE</strong> and further evaluation ceases.</p>
<p>This approach does not use aggregation (which is good), but it needs to browse all products and do random index seeks for each of the products (which is bad).</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_product p
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  1 AS opt, 0 AS l, 100 AS h
                UNION ALL
                SELECT  2 AS opt, 10 AS l, 90 AS h
                UNION ALL
                SELECT  3 AS opt, 20 AS l, 80 AS h
                UNION ALL
                SELECT  4 AS opt, 30 AS l, 70 AS h
                UNION ALL
                SELECT  5 AS opt, 40 AS l, 60 AS h
                UNION ALL
                SELECT  6 AS opt, 50 AS l, 50 AS h
                ) v
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    t_option o
                WHERE   o.product_id = p.id
                        AND o.option_id = opt
                        AND o.value BETWEEN l AND h
                )
        );
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">22</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.8438s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">50115</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">6</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">9</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">o</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ix_option_o_v</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">8</td>
<td class="varchar">20100402_options.p.id,v.opt</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">7</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">8</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union3,4,5,6,7,8&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100402_options.p.id&#39; of SELECT #9 was resolved in SELECT #1
Field or reference &#39;v.opt&#39; of SELECT #9 was resolved in SELECT #2
Field or reference &#39;v.l&#39; of SELECT #9 was resolved in SELECT #2
Field or reference &#39;v.h&#39; of SELECT #9 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_product` `p` where (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`p`.`id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))))))))
</pre>
<p>This query runs for <strong>843 ms</strong>, or three times as long as its <code>GROUP BY</code> counterpart.</p>
<p><strong>GROUP BY</strong> version seems to be the most efficient, but can we optimize it somehow?</p>
<h3>Mixing two approaches</h3>
<p>The main problem with the <code>GROUP BY</code> approach is, well, <code>GROUP BY</code>. In <strong>MySQL</strong>, it requires sorting or materialization (or both) which are quite expensive operations. Also, the recordset for each of the attributes is selected independently: no filter on one of the attributes affects the other ones. The filtering is only done after the aggregation.</p>
<p>The main problem with the <code>NOT EXISTS</code> approach is that every product should be checked for the conditions, though only a tiny fraction of all products satisfies at least several of them. Checking for conditions also requires index seeks in the blocks that are quite far away from each other which is not good for performance (especially if the index does not fit completely into the cache).</p>
<p>However, we can mix the two approaches.</p>
<p>Instead of using the product table as a source for all possible entities to check against the <strong>EAV</strong> table, we can take the condition on only one attribute that is the least likely to be satisfied.</p>
<p>Since entity/attribute pair form a natural primary key in the <strong>EAV</strong> table, filtering on an attribute is guaranteed to result in a set of unique entities. Each of these entities should then be further checked using the <code>NOT EXISTS</code> approach, but this time there are much fewer checks that need to be made, since with a properly chosen primary condition (which should be the most strong one) most of the non-matching entities are already filtered out.</p>
<p>This approach, on the one hand,  does not require aggregation; on the other hand, the conditions are sieved rather then added up.</p>
<p>Let&#8217;s build the query:</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    t_option oo
WHERE   oo.option_id = 6
        AND oo.value BETWEEN 50 AND 50
        AND NOT EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  1 AS opt, 0 AS l, 100 AS h
                UNION ALL
                SELECT  2 AS opt, 10 AS l, 90 AS h
                UNION ALL
                SELECT  3 AS opt, 20 AS l, 80 AS h
                UNION ALL
                SELECT  4 AS opt, 30 AS l, 70 AS h
                UNION ALL
                SELECT  5 AS opt, 40 AS l, 60 AS h
                ) v
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    t_option o
                WHERE   o.product_id = oo.product_id
                        AND o.option_id = opt
                        AND o.value BETWEEN l AND h
                )
        );
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>COUNT(*)</th>
</tr>
<tr>
<td class="bigint">22</td>
</tr>
<tr class="statusbar">
<td colspan="100">1 row fetched in 0.0001s (0.0105s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">oo</td>
<td class="varchar">ref</td>
<td class="varchar">ix_option_o_v</td>
<td class="varchar">ix_option_o_v</td>
<td class="varchar">8</td>
<td class="varchar">const,const</td>
<td class="bigint">1090</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">5</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">8</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">o</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY,ix_option_o_v</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">8</td>
<td class="varchar">20100402_options.oo.product_id,v.opt</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where</td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">7</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union3,4,5,6,7&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100402_options.oo.product_id&#39; of SELECT #8 was resolved in SELECT #1
Field or reference &#39;v.opt&#39; of SELECT #8 was resolved in SELECT #2
Field or reference &#39;v.l&#39; of SELECT #8 was resolved in SELECT #2
Field or reference &#39;v.h&#39; of SELECT #8 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_option` `oo` where ((`20100402_options`.`oo`.`option_id` = 6) and (`20100402_options`.`oo`.`value` between 50 and 50) and (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`oo`.`product_id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`)))))))))
</pre>
<p>As we can see, using the most selective condition as an initial row source for the further checks greatly improved the query speed.</p>
<p>The query time is now only <strong>10 ms</strong>, or <strong>30</strong> times as fast as the <code>GROUP BY</code> approach.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/04/02/multiple-attributes-in-a-eav-table-group-by-vs-not-exists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bayesian classification</title>
		<link>http://explainextended.com/2010/03/25/bayesian-classification/</link>
		<comments>http://explainextended.com/2010/03/25/bayesian-classification/#comments</comments>
		<pubDate>Thu, 25 Mar 2010 20:00:58 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4598</guid>
		<description><![CDATA[From Stack Overflow: Suppose you&#8217;ve visited sites S0 … S50. All except S0 are 48% female; S0 is 100% male. I&#8217;m guessing your gender, and I want to have a value close to 100%, not just the 49% that a straight average would give. Also, consider that most demographics (i.e. everything other than gender) does [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2448522/mysql-stats-weighting-an-average-to-accentuate-differences-from-the-mean"><strong>Stack Overflow</strong></a>:</p>
<blockquote><p>Suppose you&#8217;ve visited sites <strong>S0 … S50</strong>. All except <strong>S0</strong> are <strong>48%</strong> female; <strong>S0</strong> is <strong>100%</strong> male.</p>
<p>I&#8217;m guessing your gender, and I want to have a value close to <strong>100%</strong>, not just the <strong>49%</strong> that a straight average would give.</p>
<p>Also, consider that most demographics (i.e. everything other than gender) does not have the average at <strong>50%</strong>. For example, the average probability of having kids <strong>0-17</strong> is <strong>~37%</strong>.</p>
<p>The more a given site&#8217;s demographics are different from this average (e.g. maybe it&#8217;s a site for parents, or for child-free people), the more it should count in my guess of your status.</p>
<p>What&#8217;s the best way to calculate this?</p></blockquote>
<p>This is a classical application of <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes&#8217; Theorem</a>.</p>
<p>The formula to calculate the posterior probability is:</p>
<p><code>P(A|B) = P(B|A) &times; P(A) / P(B) = P(B|A) &times; P(A) / (P(B|A) &times; P(A) + P(B|A<sup>*</sup>) &times; P(A<sup>*</sup>))</code></p>
<p>, where:</p>
<ul>
<li><code>P(A|B)</code> is the posterior probability of the visitor being a male (given that he visited the site)</li>
<li><code>P(A)</code> is the prior probability of the visitor being a male (initially, <strong>50%</strong>)</li>
<li><code>P(B)</code> is the probability of (any Internet user) visiting the site</li>
<li><code>P(B|A)</code> is the probability of a user visiting the site, given that he is a male</li>
<li><code>P(A<sup>*</sup>)</code> is the prior probability of the visitor not being a male (initially, <strong>50%</strong>)</li>
<li><code>P(B|A<sup>*</sup>)</code> is the probability of a user visiting the site, given that she is not a male.</li>
</ul>
<p><span id="more-4598"></span><br />
Since a user can only be male or female:</p>
<p><code>P(A|B) = P(B|A)&times;P(A)/P(B) = P(B|A)&times;P(A) / (P(B|A)&times;P(A) + (1 - P(B|A))&times;(1 - P(A)))</code></p>
<p><code>P(B|A)</code> is the number stored in the database (probability of the user being a male).</p>
<p>We consider the events of visiting the different sites to be independent (a fact that the user visited site <strong>A</strong> neither influences nor is influenced by the fact that the user also visited site <strong>B</strong>. This is of course not so, since the sites may exchange links etc., but we make this assumption for the sake of simplicity.</p>
<p>So, given a series of the sites, we take the initial probability (<code>P<sub>0</sub> = 0.5</code>) and recursively substitute it into the following formula:</p>
<p><code>P<sub>n</sub> = S<sub>n</sub>&times;P<sub>n-1</sub> / (S<sub>n</sub>&times;P<sub>n-1</sub> + (1 - S<sub>n</sub>)&times;(1 - P<sub>n-1</sub>))</code></p>
<p>Simple calculations show us the the recursion (which <strong>MySQL</strong> is not good at) can be replaced with an aggregate formula:</p>
<p><code>P = P0 * PROD(S) / (P0 * PROD(S) + (1 - P0) * PROD(1 - S))</code></p>
<p>, where <code>PROD</code> is the aggregate product of the sites&#8217; probabilities of their visitors being a male.</p>
<p><strong>SQL</strong> does not have a built-in aggregate product function, but it can be easily replaced by the aggregate sum on the logarithmic scale:</p>
<p><code>P = P0 * EXP(SUM(LN(S))) / (P0 * EXP(SUM(LN(S))) + (1 - P0) * EXP(SUM(LN(1 - S))))</code></p>
<p>Given all that, let&#8217;s create some sample tables:</p>
<p><a href="#" onclick="xcollapse('X3179');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X3179" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_user (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(20) NOT NULL,
        gender CHAR(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_site (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(20) NOT NULL,
        male DOUBLE NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_visit (
        u_id INT NOT NULL,
        s_id INT NOT NULL,
        PRIMARY KEY (u_id, s_id)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000);
COMMIT;

INSERT
INTO    t_user
SELECT  id, CONCAT(&#039;User &#039;, id),
        CASE WHEN RAND(20100325) &gt; 0.5 THEN &#039;M&#039; ELSE &#039;F&#039; END
FROM    filler;

INSERT
INTO    t_site
SELECT  id, CONCAT(&#039;Site &#039;, id),
        RAND(20100325 &lt;&lt; 1) * 0.94 + 0.03
FROM    filler;

INSERT
INTO    t_visit
SELECT  u_id, s_id
FROM    (
        SELECT  u.id AS u_id, s.id AS s_id, u.gender, s.male,
                RAND(20100325 &lt;&lt; 2) AS rnds,
                RAND(20100325 &lt;&lt; 3) AS rndm
        FROM    t_user u
        CROSS JOIN
                t_site s
        ) q
WHERE   rnds &lt; 0.05
        AND rndm &lt; CASE gender WHEN &#039;M&#039; THEN male ELSE 1 - male END;
</pre>
</div>
<p>There are <strong>1,000</strong> users and <strong>1,000</strong> sites. The sites are assigned with <q>maleness</q> from <strong>0.03</strong> to <strong>0.97</strong>.</p>
<p>User randomly visit the sites according to their gender and the site gender distribution. There are <strong>25</strong> visits per user in average.</p>
<p>Let&#8217;s try to guess the users&#8217; gender and return only wrong guesses.</p>
<p>We will assume that the user is male when the posterior probability of the user being male is more than <strong>0.99</strong>, female if that is less than <strong>0.01</strong>, undefined if within <strong>0.01</strong> and <strong>0.99</strong>:</p>
<pre class="brush: sql">
SELECT  *, CASE WHEN posterior &lt; 0.01 THEN &#039;F&#039; WHEN posterior &gt; 0.99 THEN &#039;M&#039; ELSE &#039;U&#039; END AS guessed
FROM    (
        SELECT  u.*,
                prior * EXP(SUM(LN(male))) / (prior * EXP(SUM(LN(male))) + (1 - prior) * EXP(SUM(LN(1 - male)))) AS posterior
        FROM    (
                SELECT  0.5 AS prior
                ) vars
        CROSS JOIN
                t_user u
        LEFT JOIN
                t_visit v
        ON      v.u_id = u.id
        LEFT JOIN
                t_site s
        ON      s.id = v.s_id
        GROUP BY
                u.id
        ) q
HAVING  guessed &lt;&gt; gender
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>gender</th>
<th>posterior</th>
<th>guessed</th>
</tr>
<tr>
<td class="integer">51</td>
<td class="varchar">User 51</td>
<td class="char">F</td>
<td class="double">0.652234131074669</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">53</td>
<td class="varchar">User 53</td>
<td class="char">F</td>
<td class="double">0.87625067361204</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">94</td>
<td class="varchar">User 94</td>
<td class="char">M</td>
<td class="double">0.732238662361337</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">264</td>
<td class="varchar">User 264</td>
<td class="char">F</td>
<td class="double">0.0520209347475727</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">475</td>
<td class="varchar">User 475</td>
<td class="char">M</td>
<td class="double">0.974230285094509</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">497</td>
<td class="varchar">User 497</td>
<td class="char">M</td>
<td class="double">0.966568719694869</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">542</td>
<td class="varchar">User 542</td>
<td class="char">F</td>
<td class="double">0.0685609699288645</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">595</td>
<td class="varchar">User 595</td>
<td class="char">M</td>
<td class="double">0.984478426560255</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">742</td>
<td class="varchar">User 742</td>
<td class="char">F</td>
<td class="double">0.0334681988009631</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">768</td>
<td class="varchar">User 768</td>
<td class="char">M</td>
<td class="double">0.960799229888108</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">800</td>
<td class="varchar">User 800</td>
<td class="char">F</td>
<td class="double">0.0181411777994256</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">867</td>
<td class="varchar">User 867</td>
<td class="char">F</td>
<td class="double">0.0401728770664721</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">882</td>
<td class="varchar">User 882</td>
<td class="char">M</td>
<td class="double">0.884671868426923</td>
<td class="varchar">U</td>
</tr>
<tr>
<td class="integer">902</td>
<td class="varchar">User 902</td>
<td class="char">F</td>
<td class="double">0.802525467489821</td>
<td class="varchar">U</td>
</tr>
<tr class="statusbar">
<td colspan="100">14 rows fetched in 0.0006s (0.1989s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1000</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">system</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">u</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">871</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">v</td>
<td class="varchar">ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20100325_bayes.u.id</td>
<td class="bigint">12</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">s</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">20100325_bayes.v.s_id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
</table>
</div>
<pre>
select `q`.`id` AS `id`,`q`.`name` AS `name`,`q`.`gender` AS `gender`,`q`.`posterior` AS `posterior`,(case when (`q`.`posterior` &lt; 0.01) then &#39;F&#39; when (`q`.`posterior` &gt; 0.99) then &#39;M&#39; else &#39;U&#39; end) AS `guessed` from (select `20100325_bayes`.`u`.`id` AS `id`,`20100325_bayes`.`u`.`name` AS `name`,`20100325_bayes`.`u`.`gender` AS `gender`,((&#39;0.5&#39; * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) / ((&#39;0.5&#39; * exp(sum(ln(`20100325_bayes`.`s`.`male`)))) + ((1 - &#39;0.5&#39;) * exp(sum(ln((1 - `20100325_bayes`.`s`.`male`))))))) AS `posterior` from (select 0.5 AS `prior`) `vars` join `20100325_bayes`.`t_user` `u` left join `20100325_bayes`.`t_visit` `v` on((`20100325_bayes`.`v`.`u_id` = `20100325_bayes`.`u`.`id`)) left join `20100325_bayes`.`t_site` `s` on((`20100325_bayes`.`s`.`id` = `20100325_bayes`.`v`.`s_id`)) where 1 group by `20100325_bayes`.`u`.`id`) `q` having (convert(`guessed` using utf8) &lt;&gt; `q`.`gender`)
</pre>
<p>From <strong>1,000</strong> users, we only have <strong>14</strong> results outside the credible interval of <strong>99%</strong> and all of these are undefined rather than false.</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/03/25/bayesian-classification/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Greatest N per group: dealing with aggregates</title>
		<link>http://explainextended.com/2010/03/18/greatest-n-per-group-dealing-with-aggregates/</link>
		<comments>http://explainextended.com/2010/03/18/greatest-n-per-group-dealing-with-aggregates/#comments</comments>
		<pubDate>Thu, 18 Mar 2010 20:00:16 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4574</guid>
		<description><![CDATA[Answering questions asked on the site. Vlad Enache asks: In MySQL I have a table called meanings with three columns: meanings person word meaning 1 1 4 1 2 19 1 2 7 1 3 5 word has 16 possible values, meaning has 26. A person assigns one or more meanings to each word. In [...]]]></description>
			<content:encoded><![CDATA[<p>Answering questions asked on the site.</p>
<p><strong>Vlad Enache</strong> asks:</p>
<blockquote><p>In <strong>MySQL</strong> I have a table called <code>meanings</code> with three columns:</p>
<table class="excel">
<caption>meanings</caption>
<tr>
<th>person</th>
<th>word</th>
<th>meaning</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>4</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>19</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>7</td>
</tr>
<tr>
<td>1</td>
<td>3</td>
<td>5</td>
</tr>
</table>
<p><code>word</code> has <strong>16</strong> possible values, <code>meaning</code> has <strong>26</strong>.</p>
<p>A person assigns one or more meanings to each word. In the sample above, person <strong>1</strong> assigned two meanings to word <strong>2</strong>.</p>
<p>There will be thousands of persons.</p>
<p>I need to find the top three meanings for each of the <strong>16</strong> words, with their frequencies. Something like:</p>
<ul>
<li><strong>word 1</strong>: meaning 5 (35% of people), meaning 19 (22% of people), meaning 2 (13% of people)</li>
<li><strong>word 2</strong>: meaning 8 (57%), meaning 1 (18%), meaning 22 (7%)</li>
</ul>
<p>etc.</p>
<p>Is it possible to solve this with a single <strong>MySQL</strong> query?
</p></blockquote>
<p>This task is a typical greatest-n-per-group problem.</p>
<p>Earlier I described some solutions to it, one <a href="/2009/03/05/row-sampling/">using session variables</a>, and another one using <a href="/2009/03/06/advanced-row-sampling/"><code>LIMIT</code> in a subquery</a>. However, these solutions imply that the records are taken from a single table, while this task needs to retrieve three greatest <em>aggregates</em>, not three greatest <em>records</em>. It is not recommended to mix the session variables with the <code>JOINs</code>, and using the <code>LIMIT</code> solution would be inefficient, since the aggregates can not be indexed.</p>
<p>Some databases, namely, <strong>PostgreSQL</strong>, used to exploit the array functionality for this task (before the window functions were introduced in <strong>8.4</strong>).</p>
<p>Unfortunately, <strong>MySQL</strong> does not support arrays, but we can emulate this behavior using string functions and <code>GROUP_CONCAT</code>.</p>
<p>Let&#8217;s create a sample table:<br />
<span id="more-4574"></span><br />
<a href="#" onclick="xcollapse('X4311');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X4311" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_meaning (
        person INT NOT NULL,
        word INT NOT NULL,
        meaning INT NOT NULL,
        PRIMARY KEY (person, word, meaning),
        KEY ix_meaning_word_meaning (word, meaning)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_meaning (person, word, meaning)
SELECT  (id - 1) div 32 + 1,
        (id - 1) % 16 + 1,
        ((((id - 1) div 16) % 2) * 13) + CEILING(RAND(20100318 &lt;&lt; 1) * 13)
FROM    filler;
</pre>
</div>
<p>The table contains <strong>1,000,000</strong> records for <strong>32,500</strong> people, each giving two random meanings to each of the <strong>16</strong> words.</p>
<p>All three fields comprise a <code>PRIMARY KEY</code>, and there is an additional index on <code>(word, meaning)</code>.</p>
<p>Here&#8217;s what we need to do to build the query:</p>
<ol>
<li>
<p>Calculate the count of records with each meaning for each word, using <code>GROUP BY</code>.</p>
<p>This will use the secondary index on <code>(word, meaning)</code> that we created above to avoid sorting (<code>using filesort</code>) or materialization (<code>using temporary</code>)</p>
</li>
<li>
<p><code>GROUP_CONCAT</code> the concatenated meaning and count from the previous recordset, ordering the values by count in descending order.</p>
<p>Note that theoretically this could result in very large strings which would not fit into <code>@@group_concat_max_len</code> and get truncated. But we can use it nevertheless: first, there are only <strong>26</strong> meanings per word possible, second, of these <strong>26</strong> meanings we only the first three.</p>
</li>
<li>
<p>Join the resultset with the dummy table containing <strong>3</strong> records. To do this, we will just concatenate <strong>3</strong> one-record <code>SELECT</code> statement using <code>UNION ALL</code>.</p>
<p>This is not a very elegant solution, since it does not allow parameterizing the <code>n</code> for the query, but <strong>MySQL</strong> does not support dummy recordset generation functions like <code>generate_series</code>.</p>
<p>Alternatively, one could build a dummy table and use a <code>LIMIT</code> to select top values from it. But for the sake of simplicity, I will go with the <code>UNION ALL</code> solution.</p>
</li>
<li>Extract the individual values from the concatenated string using <code>SUBSTRING_INDEX</code></li>
<li>
<p>Calculate the ratios, using the number of distinct persons as a total.</p>
<p>This way, ratio of each word&#8217;s meaning will be computed correctly, though the sum of ratios can exceed <strong>100%</strong> (since one person can give more than one meaning to a word).</p>
</li>
</ol>
<p>And here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  word, 4 - lvl AS position, SUBSTRING_INDEX(rec, &#039;:&#039;, 1) AS meaning, SUBSTRING_INDEX(rec, &#039;:&#039;, -1) / total AS rate
FROM    (
        SELECT  word, total, SUBSTRING_INDEX(SUBSTRING_INDEX(tops, &#039;,&#039;, -lvl), &#039;,&#039;, 1) AS rec, lvl
        FROM    (
                SELECT  word, total, SUBSTRING_INDEX(ms, &#039;,&#039;, 3) AS tops, lvl
                FROM    (
                        SELECT  word,
                                (
                                SELECT  COUNT(*)
                                FROM    (
                                        SELECT  DISTINCT person
                                        FROM    t_meaning
                                        ) dp
                                ) AS total,
                                CAST(GROUP_CONCAT(CONCAT_WS(&#039;:&#039;, meaning, cnt) ORDER BY cnt DESC, meaning DESC) AS CHAR) AS ms
                        FROM    (
                                SELECT  word, meaning, COUNT(*) AS cnt
                                FROM    t_meaning
                                GROUP BY
                                        word, meaning
                                ) q
                        GROUP BY
                                word
                        ) q2
                CROSS JOIN
                        (
                        SELECT  1 AS lvl
                        UNION ALL
                        SELECT  2 AS lvl
                        UNION ALL
                        SELECT  3 AS lvl
                        ) fields
                ) q3
        ) q4
ORDER BY
        word, position
</pre>
<div class="terminal">
<table class="terminal">
<tr>
<th>word</th>
<th>position</th>
<th>meaning</th>
<th>rate</th>
</tr>
<tr>
<td class="integer">1</td>
<td class="bigint">1</td>
<td class="blob">16</td>
<td class="double">0.079456</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="bigint">2</td>
<td class="blob">17</td>
<td class="double">0.079296</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="bigint">3</td>
<td class="blob">12</td>
<td class="double">0.079264</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="bigint">1</td>
<td class="blob">9</td>
<td class="double">0.079584</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="bigint">2</td>
<td class="blob">10</td>
<td class="double">0.079264</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="bigint">3</td>
<td class="blob">20</td>
<td class="double">0.078656</td>
</tr>
<tr class="break">
<td colspan="100"/></tr>
<tr>
<td class="integer">16</td>
<td class="bigint">1</td>
<td class="blob">14</td>
<td class="double">0.080064</td>
</tr>
<tr>
<td class="integer">16</td>
<td class="bigint">2</td>
<td class="blob">18</td>
<td class="double">0.079424</td>
</tr>
<tr>
<td class="integer">16</td>
<td class="bigint">3</td>
<td class="blob">4</td>
<td class="double">0.078976</td>
</tr>
<tr class="statusbar">
<td colspan="100">48 rows fetched in 0.0017s (1.1094s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">48</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">48</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived8&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">3</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">16</td>
<td class="double">100.00</td>
<td class="varchar">Using join buffer</td>
</tr>
<tr>
<td class="bigint">8</td>
<td class="varchar">DERIVED</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">9</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint">10</td>
<td class="varchar">UNION</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">No tables used</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union8,9,10&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived7&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">416</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">7</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_meaning</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">ix_meaning_word_meaning</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">1000443</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">SUBQUERY</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Select tables optimized away</td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">DERIVED</td>
<td class="varchar">t_meaning</td>
<td class="varchar">range</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">58850</td>
<td class="double">100.00</td>
<td class="varchar">Using index for group-by</td>
</tr>
</table>
</div>
<pre>
select `q4`.`word` AS `word`,(4 - `q4`.`lvl`) AS `position`,substring_index(`q4`.`rec`,&#39;:&#39;,1) AS `meaning`,(substring_index(`q4`.`rec`,&#39;:&#39;,-(1)) / `q4`.`total`) AS `rate` from (select `q3`.`word` AS `word`,`q3`.`total` AS `total`,substring_index(substring_index(`q3`.`tops`,&#39;,&#39;,-(`q3`.`lvl`)),&#39;,&#39;,1) AS `rec`,`q3`.`lvl` AS `lvl` from (select `q2`.`word` AS `word`,`q2`.`total` AS `total`,substring_index(`q2`.`ms`,&#39;,&#39;,3) AS `tops`,`fields`.`lvl` AS `lvl` from (select `q`.`word` AS `word`,(select count(0) AS `COUNT(*)` from (select distinct `20100318_word`.`t_meaning`.`person` AS `person` from `20100318_word`.`t_meaning`) `dp`) AS `total`,cast(group_concat(concat_ws(&#39;:&#39;,`q`.`meaning`,`q`.`cnt`) order by `q`.`cnt` DESC,`q`.`meaning` DESC separator &#39;,&#39;) as char charset latin1) AS `ms` from (select `20100318_word`.`t_meaning`.`word` AS `word`,`20100318_word`.`t_meaning`.`meaning` AS `meaning`,count(0) AS `cnt` from `20100318_word`.`t_meaning` group by `20100318_word`.`t_meaning`.`word`,`20100318_word`.`t_meaning`.`meaning`) `q` group by `q`.`word`) `q2` join (select 1 AS `lvl` union all select 2 AS `lvl` union all select 3 AS `lvl`) `fields`) `q3`) `q4` order by `q4`.`word`,(4 - `q4`.`lvl`)
</pre>
<p>Note that we used</p>
<pre class="brush: sql">
SELECT  COUNT(*)
FROM    (
        SELECT  DISTINCT person
        FROM    t_meaning
        ) q
</pre>
<p>instead of more simple</p>
<pre class="brush: sql">
SELECT  COUNT(DISTINCT person)
FROM    t_meaning
</pre>
<p>The latter looks more elegant, but <strong>MySQL</strong> is not able to use <code>INDEX FOR GROUP BY</code> for it.</p>
<p>That&#8217;s why we used the former subquery to calculate the totals, and we can see in the query plan that the <code>INDEX FOR GROUP BY</code> is being used.</p>
<p>The overall query time is <strong>1.10 sec</strong> which is just a little more than the time required to iterate all records in the table.</p>
<p>Hope that helps.</p>
<hr/>
<p>I&#8217;m always glad to answer the questions regarding database queries.</p>
<p><a href="/ask-a-question"><strong>Ask me a question</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/03/18/greatest-n-per-group-dealing-with-aggregates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Aggregates and LEFT JOIN</title>
		<link>http://explainextended.com/2010/03/05/aggregates-and-left-join/</link>
		<comments>http://explainextended.com/2010/03/05/aggregates-and-left-join/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 20:00:09 +0000</pubDate>
		<dc:creator>Quassnoi</dc:creator>
				<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://explainextended.com/?p=4548</guid>
		<description><![CDATA[From Stack Overflow: I have a table product with products and table sale with all sale operations that were done on these products. I would like to get 10 most often sold products today and what I did is this: SELECT p.*, COUNT(s.id) AS sumsell FROM product p LEFT JOIN sale s ON s.product_id = [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://stackoverflow.com/questions/2388122/how-to-increase-last-day-count-query-performance"><strong>Stack Overflow</strong></a>:</p>
<blockquote>
<p>I have a table <code>product</code> with products and table <code>sale</code> with all sale operations that were done on these products.</p>
<p>I would like to get <strong>10</strong> most often sold products today and what I did is this:</p>
<pre class="brush: sql">
SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt &gt;= &#039;2010-01-01&#039;
        AND s.dt &lt; &#039;2010-01-02&#039;
GROUP BY
        p.id
ORDER BY
        sumsell DESC
LIMIT 10
</pre>
<p>, but performance of it is very slow.</p>
<p>What can I do to increase performance of this particular query?<br />
<!-- -->
</p></blockquote>
<p>The query involves a <code>LEFT JOIN</code> which in <strong>MySQL</strong> world means that <code>products</code> will be made leading in the query. Each record of <code>product</code> will be taken and checked against <code>sale</code> table to find out the number of matching records. If no matching records are found, <strong>0</strong> is returned.</p>
<p>Let&#8217;s create the sample tables:<br />
<span id="more-4548"></span><br />
<a href="#" onclick="xcollapse('X9754');return false;"><strong>Table creation details</strong></a><br />
</p>
<div id="X9754" style="display: none; ">
<pre class="brush: sql">
CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE TABLE product (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE sale (
        id INT NOT NULL PRIMARY KEY,
        product_id INT NOT NULL,
        amount FLOAT NOT NULL,
        dt DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    product
SELECT  id, CONCAT(&#039;Product &#039;, id)
FROM    filler;

INSERT
INTO    sale (id, product_id, amount, dt)
SELECT  id,
        FLOOR(RAND(20100305) * 500000) + 1,
        RAND(20100305 &lt;&lt; 1) * 100 + 1,
        &#039;2010-03-06&#039; - INTERVAL id MINUTE
FROM    filler;

CREATE INDEX ix_sale_product_dt ON sale (product_id, dt);
CREATE INDEX ix_sale_dt_product ON sale (dt, product_id);
</pre>
</div>
<p>The table contains <strong>500,000</strong> products and <strong>500,000</strong> random sales (<strong>1,440</strong> sales per day).</p>
<p>Now, let&#8217;s run the query similar to the author&#8217;s one. I adjusted the period so that fewer than <strong>10</strong> actual sales were made during the period and <code>LEFT JOIN</code> records can be seen in the table:</p>
<pre class="brush: sql">
SELECT  p.*, COUNT(s.id) AS sumsell
FROM    product p
LEFT JOIN
        sale s
ON      s.product_id = p.id
        AND s.dt &gt;= &#039;2010-01-01&#039;
        AND s.dt &lt; &#039;2010-01-01 00:07:00&#039;
GROUP BY
        p.id
ORDER BY
        sumsell DESC, p.id
LIMIT 10
</pre>
<p><a href="#" onclick="xcollapse('X3182');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X3182" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>sumsell</th>
</tr>
<tr>
<td class="integer">50630</td>
<td class="varchar">Product 50630</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">143395</td>
<td class="varchar">Product 143395</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">222114</td>
<td class="varchar">Product 222114</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">322966</td>
<td class="varchar">Product 322966</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">340133</td>
<td class="varchar">Product 340133</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">345789</td>
<td class="varchar">Product 345789</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">462937</td>
<td class="varchar">Product 462937</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Product 1</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="varchar">Product 2</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="varchar">Product 3</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0009s (3.0312s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">p</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">5007270.00</td>
<td class="varchar">Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">SIMPLE</td>
<td class="varchar">s</td>
<td class="varchar">ref</td>
<td class="varchar">ix_sale_product_dt,ix_sale_dt_product</td>
<td class="varchar">ix_sale_product_dt</td>
<td class="varchar">4</td>
<td class="varchar">20100305_left.p.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using index</td>
</tr>
</table>
</div>
<pre>
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,count(`20100305_left`.`s`.`id`) AS `sumsell` from `20100305_left`.`product` `p` left join `20100305_left`.`sale` `s` on(((`20100305_left`.`s`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`s`.`dt` &gt;= &#39;2010-01-01&#39;) and (`20100305_left`.`s`.`dt` &lt; &#39;2010-01-01 00:07:00&#39;))) where 1 group by `20100305_left`.`p`.`id` order by count(`20100305_left`.`s`.`id`) desc,`20100305_left`.`p`.`id` limit 10
</pre>
</div>
<p>The query runs for <strong>3 seconds</strong>.</p>
<p>We see that, first, <code>product</code> is made leading, and, second, only a part of the index on <code>sale (product, dt)</code> is used: each sale is only filtered on product, not on date.</p>
<p>Since there were only <strong>7</strong> sales during the period we have chosen, it would be a wise decision to make <code>sale</code> leading in the join so that it could be filtered on date and the resulting recordset was then joined to <code>product</code>. This would result in at most <strong>7</strong> <code>PRIMARY KEY</code> seeks instead of <strong>500,000</strong> range scans and would be much more efficient.</p>
<p>However, this is only possible with the <code>INNER JOIN</code>, and if there are less then <strong>10</strong> products sold within the time period, we will not see the rest.</p>
<p>To work around this, we need to emulate the <code>LEFT JOIN</code>:</p>
<ol>
<li>
<p>Find the products sold within the time period, using an <code>INNER JOIN</code> of <code>product</code> with the resultset containg aggregated sales.</p>
</li>
<li>
<p>Find the products <strong>not</strong> sold within the time period, using <code>NOT EXISTS</code> predicate.</p>
</li>
<li>
<p>Concatenate the two resultsets using <code>UNION ALL</code>.</p>
</li>
</ol>
<p>The step <strong>2</strong> implies that <code>product</code> is leading again, so normally it would not be much of improvement. But in our case, we don&#8217;t need the whole recordset, we only need the top <strong>10</strong> sales.</p>
<p>So we can just order and limit the recordsets retrieved on steps <strong>1</strong> and <strong>2</strong> to ten records each, concatenate them, then order and limit the resulting recordset again to ten records.</p>
<p>The second resultset will contain a hardcoded <strong>0</strong> in the <code>sumsell</code>, so we just need to order it on <code>product.id</code>. Since <code>product</code> is an <strong>InnoDB</strong> table and <code>product.id</code> is a clustered <code>PRIMARY KEY</code>, this is not a problem.</p>
<p>Here&#8217;s the query:</p>
<pre class="brush: sql">
SELECT  p.*, sumsell
FROM    (
        SELECT  *
        FROM    (
                SELECT  product_id, sumsell
                FROM    (
                        SELECT  product_id, COUNT(*) AS sumsell
                        FROM    sale si
                        WHERE   dt &gt;= &#039;2010-01-01&#039;
                                AND dt &lt; &#039;2010-01-01 00:07:00&#039;
                        GROUP BY
                                product_id
                        ) si
                ORDER BY
                        sumsell DESC, product_id
                LIMIT 10
                ) q1
        UNION ALL
        SELECT  *
        FROM    (
                SELECT  p.id, 0
                FROM    product p
                WHERE   NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    sale si
                        WHERE   product_id = p.id
                                AND dt &gt;= &#039;2010-01-01&#039;
                                AND dt &lt; &#039;2010-01-01 00:07:00&#039;
                        )
                ORDER BY
                        p.id
                LIMIT 10
                ) q2
        ORDER BY
                sumsell DESC, product_id
        LIMIT 10
        ) q
JOIN    product p
ON      p.id = q.product_id
</pre>
<p><a href="#" onclick="xcollapse('X1041');return false;"><strong>View query details</strong></a><br />
</p>
<div id="X1041" style="display: none; ">
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>name</th>
<th>sumsell</th>
</tr>
<tr>
<td class="integer">50630</td>
<td class="varchar">Product 50630</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">143395</td>
<td class="varchar">Product 143395</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">222114</td>
<td class="varchar">Product 222114</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">322966</td>
<td class="varchar">Product 322966</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">340133</td>
<td class="varchar">Product 340133</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">345789</td>
<td class="varchar">Product 345789</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">462937</td>
<td class="varchar">Product 462937</td>
<td class="bigint">1</td>
</tr>
<tr>
<td class="integer">1</td>
<td class="varchar">Product 1</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">2</td>
<td class="varchar">Product 2</td>
<td class="bigint">0</td>
</tr>
<tr>
<td class="integer">3</td>
<td class="varchar">Product 3</td>
<td class="bigint">0</td>
</tr>
<tr class="statusbar">
<td colspan="100">10 rows fetched in 0.0012s (0.0064s)</td>
</tr>
</table>
</div>
<div class="terminal">
<table class="terminal">
<tr>
<th>id</th>
<th>select_type</th>
<th>table</th>
<th>type</th>
<th>possible_keys</th>
<th>key</th>
<th>key_len</th>
<th>ref</th>
<th>rows</th>
<th>filtered</th>
<th>Extra</th>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">&lt;derived2&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">1</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">p</td>
<td class="varchar">eq_ref</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar">q.product_id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">2</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived3&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">7</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">3</td>
<td class="varchar">DERIVED</td>
<td class="varchar">&lt;derived4&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">7</td>
<td class="double">100.00</td>
<td class="varchar">Using filesort</td>
</tr>
<tr>
<td class="bigint">4</td>
<td class="varchar">DERIVED</td>
<td class="varchar">si</td>
<td class="varchar">range</td>
<td class="varchar">ix_sale_dt_product</td>
<td class="varchar">ix_sale_dt_product</td>
<td class="varchar">8</td>
<td class="varchar"></td>
<td class="bigint">6</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index; Using temporary; Using filesort</td>
</tr>
<tr>
<td class="bigint">5</td>
<td class="varchar">UNION</td>
<td class="varchar">&lt;derived6&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">100.00</td>
<td class="varchar"></td>
</tr>
<tr>
<td class="bigint">6</td>
<td class="varchar">DERIVED</td>
<td class="varchar">p</td>
<td class="varchar">index</td>
<td class="varchar"></td>
<td class="varchar">PRIMARY</td>
<td class="varchar">4</td>
<td class="varchar"></td>
<td class="bigint">10</td>
<td class="double">5007270.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint">7</td>
<td class="varchar">DEPENDENT SUBQUERY</td>
<td class="varchar">si</td>
<td class="varchar">ref</td>
<td class="varchar">ix_sale_product_dt,ix_sale_dt_product</td>
<td class="varchar">ix_sale_product_dt</td>
<td class="varchar">4</td>
<td class="varchar">20100305_left.p.id</td>
<td class="bigint">1</td>
<td class="double">100.00</td>
<td class="varchar">Using where; Using index</td>
</tr>
<tr>
<td class="bigint"></td>
<td class="varchar">UNION RESULT</td>
<td class="varchar">&lt;union2,5&gt;</td>
<td class="varchar">ALL</td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="varchar"></td>
<td class="bigint"></td>
<td class="double"></td>
<td class="varchar">Using filesort</td>
</tr>
</table>
</div>
<pre>
Field or reference &#39;20100305_left.p.id&#39; of SELECT #7 was resolved in SELECT #6
select `20100305_left`.`p`.`id` AS `id`,`20100305_left`.`p`.`name` AS `name`,`q`.`sumsell` AS `sumsell` from (select `q1`.`product_id` AS `product_id`,`q1`.`sumsell` AS `sumsell` from (select `si`.`product_id` AS `product_id`,`si`.`sumsell` AS `sumsell` from (select `20100305_left`.`si`.`product_id` AS `product_id`,count(0) AS `sumsell` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`dt` &gt;= &#39;2010-01-01&#39;) and (`20100305_left`.`si`.`dt` &lt; &#39;2010-01-01 00:07:00&#39;)) group by `20100305_left`.`si`.`product_id`) `si` order by `si`.`sumsell` desc,`si`.`product_id` limit 10) `q1` union all select `q2`.`id` AS `id`,`q2`.`0` AS `0` from (select `20100305_left`.`p`.`id` AS `id`,0 AS `0` from `20100305_left`.`product` `p` where (not(exists(select NULL AS `NULL` from `20100305_left`.`sale` `si` where ((`20100305_left`.`si`.`product_id` = `20100305_left`.`p`.`id`) and (`20100305_left`.`si`.`dt` &gt;= &#39;2010-01-01&#39;) and (`20100305_left`.`si`.`dt` &lt; &#39;2010-01-01 00:07:00&#39;))))) order by `20100305_left`.`p`.`id` limit 10) `q2` order by `sumsell` desc,`product_id` limit 10) `q` join `20100305_left`.`product` `p` where (`20100305_left`.`p`.`id` = `q`.`product_id`)
</pre>
</div>
<p>This query completes in less than <strong>7 ms</strong> (which is comparable to the time measurement error).</p>
]]></content:encoded>
			<wfw:commentRss>http://explainextended.com/2010/03/05/aggregates-and-left-join/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
