Selecting non-unique rows

Sometimes we need to select all rows for a table that have duplicate values in some of the columns. Like, we want to select all user comments for all posts commented by more than one user.

If there are two or more comments for a post, we select all comments for this post; if there is only one comment, we select none.

Let's create the sample tables to illustrate our needs:

Table creation details

CREATE TABLE t_dup (
     id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
 value INT NOT NULL,
 KEY ix_dup_value (value)
);

CREATE TABLE t_dup2 (
 id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
 value INT NOT NULL,
 KEY ix_dup_value (value)
);

DELIMITER $$

CREATE PROCEDURE prc_fill_dup (cnt INT)
BEGIN
  DECLARE _cnt INT;
   SET _cnt := 1;
      WHILE _cnt <= cnt DO
                INSERT
              INTO    t_dup (id, value)
           VALUES  (_cnt, _cnt % 3000000);
             SET _cnt := _cnt + 1;
       END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_fill_dup (10000000);
COMMIT;

INSERT
INTO  t_dup2
SELECT        id, id % 3
FROM      t_dup;
&#91;/sourcecode&#93;
</div>

We have two tables with <strong>10,000,000</strong> records in each of them. The first table has <strong>3,000,000</strong> distinct <code>value</code>'s, while the seconds one has only <strong>3</strong>.

We can use <code>COUNT(*)</code> to filter out the values that don't have duplicates. To avoid selecting millions of rows, we will measure the speed by selecting the <code>SUM()</code> if <code>id</code>'s from rows that fit our criteria:


SELECT  SUM(id)
FROM    (
        SELECT  id
        FROM    (
                SELECT  value, COUNT(*)
                FROM    t_dup
                GROUP BY value
                HAVING  COUNT(*) > 1
                ) dpd
        JOIN    t_dup dp
        ON      dp.value = dpd.value
        ) q

SUM(id)
50000005000000
1 row fetched in 0.0000s (235.9501s)

Of course, ten million rows is a lot of rows, but 235 seconds is way too much.

How can we improve the query?

First of all, we don't need COUNT(*). It's enough to know that there are duplicates, not the exact number of the duplicates. To filter out the rows, we can just use the following query:

SELECT  SUM(id)
FROM    (
        SELECT  id
        FROM    (
                SELECT  DISTINCT value
                FROM    t_dup
                ) dpd
        JOIN    t_dup dp
        ON      dp.value = dpd.value
        WHERE   EXISTS  (
                SELECT  1
                FROM    t_dup dpi
                WHERE   dpi.value = dpd.value
                LIMIT 1, 1
                )
        ) q

SUM(id)
50000005000000
1 row fetched in 0.0000s (141.0598s)

Much nicer.

But can we improve the query even more?

But looking on the output of EXPLAIN EXTENDED, we see:

DEPENDENT SUBQUERY here means that the subquery cannot be cached and should be reevaluated on each row. This is not too bad, as it will be called for each of distinct value's exactly once anyway. But we have a join with another instance of the table, which is not very efficient.

If we just get rid of the join and just call the subquery for all values instead of calling it for distinct values. It will kill all performance benefits resulted from eliminating the join.

But we can cache the results of the subquery in a session variable, and call the just once for each of the distinct values.

We do it in the following steps:

Select all rows from the table, sorting them by value
Check if value has changed from the previous row, comparing it with the previous result stored in session variable @s
If value has changed, then check the row for duplicates with a subquery, and store the result in the session variable @r. Therefore, @r changes if and only if value changes
Return all rows with non-NULL @r

The query looks like this:

SELECT  SUM(id)
FROM    (
        SELECT  id
        FROM    (
                SELECT  @s := NULL,
                        @r := NULL
                ) vars, t_dup dpo
        WHERE   CASE
                    WHEN @s IS NULL OR @s <> value THEN
                        @r := (
                        SELECT  1
                        FROM    t_dup dpi
                        WHERE   dpi.value = dpo.value
                        LIMIT 1, 1
                        )
                    ELSE
                        @r
                END IS NOT NULL
                AND (@s := value) IS NOT NULL
        ORDER BY
                value
        ) q

SUM(id)
50000005000000
1 row fetched in 0.0000s (87.4676s)

3 times as fast as the the original query, 1.5 times as fast as optimized one using the join.

Written by Quassnoi

March 22nd, 2009 at 11:00 pm

Posted in MySQL

One Response to 'Selecting non-unique rows'

Subscribe to comments with RSS

I really like your articles and hope that you will continue this blog in a future. Thank you for sharing your knowledge.
One question on the second version of the query.
I assume that there are no sense to use a derrived query
(… SELECT DISTINCT value FROM t_dup… ) because we anyway join the results with t_dup and after that use the subquery in WHERE clause to all rows in t_dup. I just removed the derrived query and got the same results. Am I right or may be I miss something?

The corrected variant of the second query:

SELECT SUM(id)
FROM (
SELECT id, value
FROM t_dup dpd
WHERE EXISTS (
SELECT 1
FROM t_dup dpi
WHERE dpi.value = dpd.value
LIMIT 1, 1
)
) q

Max

11 Jan 17 at 14:18

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

EXPLAIN EXTENDED