Selecting non-unique rows
Sometimes we need to select all rows for a table that have duplicate values in some of the columns. Like, we want to select all user comments for all posts commented by more than one user.
If there are two or more comments for a post, we select all comments for this post; if there is only one comment, we select none.
Let's create the sample tables to illustrate our needs:
Table creation details
CREATE TABLE t_dup ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, value INT NOT NULL, KEY ix_dup_value (value) ); CREATE TABLE t_dup2 ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, value INT NOT NULL, KEY ix_dup_value (value) ); DELIMITER $$ CREATE PROCEDURE prc_fill_dup (cnt INT) BEGIN DECLARE _cnt INT; SET _cnt := 1; WHILE _cnt <= cnt DO INSERT INTO t_dup (id, value) VALUES (_cnt, _cnt % 3000000); SET _cnt := _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_fill_dup (10000000); COMMIT; INSERT INTO t_dup2 SELECT id, id % 3 FROM t_dup; [/sourcecode] </div> We have two tables with <strong>10,000,000</strong> records in each of them. The first table has <strong>3,000,000</strong> distinct <code>value</code>'s, while the seconds one has only <strong>3</strong>. We can use <code>COUNT(*)</code> to filter out the values that don't have duplicates. To avoid selecting millions of rows, we will measure the speed by selecting the <code>SUM()</code> if <code>id</code>'s from rows that fit our criteria: SELECT SUM(id) FROM ( SELECT id FROM ( SELECT value, COUNT(*) FROM t_dup GROUP BY value HAVING COUNT(*) > 1 ) dpd JOIN t_dup dp ON dp.value = dpd.value ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000005000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0000s (235.9501s) |
Of course, ten million rows is a lot of rows, but 235 seconds is way too much.
How can we improve the query?
First of all, we don't need COUNT(*)
. It's enough to know that there are duplicates, not the exact number of the duplicates. To filter out the rows, we can just use the following query:
SELECT SUM(id) FROM ( SELECT id FROM ( SELECT DISTINCT value FROM t_dup ) dpd JOIN t_dup dp ON dp.value = dpd.value WHERE EXISTS ( SELECT 1 FROM t_dup dpi WHERE dpi.value = dpd.value LIMIT 1, 1 ) ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000005000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0000s (141.0598s) |
Much nicer.
But can we improve the query even more?
But looking on the output of EXPLAIN EXTENDED
, we see:
DEPENDENT SUBQUERY
here means that the subquery cannot be cached and should be reevaluated on each row. This is not too bad, as it will be called for each of distinct value
's exactly once anyway. But we have a join with another instance of the table, which is not very efficient.
If we just get rid of the join and just call the subquery for all values instead of calling it for distinct values. It will kill all performance benefits resulted from eliminating the join.
But we can cache the results of the subquery in a session variable, and call the just once for each of the distinct values.
We do it in the following steps:
- Select all rows from the table, sorting them by
value
- Check if
value
has changed from the previous row, comparing it with the previous result stored in session variable@s
- If value has changed, then check the row for duplicates with a subquery, and store the result in the session variable
@r
. Therefore,@r
changes if and only ifvalue
changes - Return all rows with non-
NULL
@r
The query looks like this:
SELECT SUM(id) FROM ( SELECT id FROM ( SELECT @s := NULL, @r := NULL ) vars, t_dup dpo WHERE CASE WHEN @s IS NULL OR @s <> value THEN @r := ( SELECT 1 FROM t_dup dpi WHERE dpi.value = dpo.value LIMIT 1, 1 ) ELSE @r END IS NOT NULL AND (@s := value) IS NOT NULL ORDER BY value ) q
SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000005000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0000s (87.4676s) |
3 times as fast as the the original query, 1.5 times as fast as optimized one using the join.
Written by Quassnoi
March 22nd, 2009 at 11:00 pm
Posted in MySQL
One Response to 'Selecting non-unique rows'
Subscribe to comments with RSS
Leave a Reply
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
- Happy New Year: 3D picture of the coronavirus in SQL
Calendar
Archives
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
I really like your articles and hope that you will continue this blog in a future. Thank you for sharing your knowledge.
One question on the second version of the query.
I assume that there are no sense to use a derrived query
(… SELECT DISTINCT value FROM t_dup… ) because we anyway join the results with t_dup and after that use the subquery in WHERE clause to all rows in t_dup. I just removed the derrived query and got the same results. Am I right or may be I miss something?
The corrected variant of the second query:
SELECT SUM(id)
FROM (
SELECT id, value
FROM t_dup dpd
WHERE EXISTS (
SELECT 1
FROM t_dup dpi
WHERE dpi.value = dpd.value
LIMIT 1, 1
)
) q
Max
11 Jan 17 at 14:18