EXPLAIN EXTENDED

How to create fast database queries

Selecting non-unique rows

with one comment

Sometimes we need to select all rows for a table that have duplicate values in some of the columns. Like, we want to select all user comments for all posts commented by more than one user.

If there are two or more comments for a post, we select all comments for this post; if there is only one comment, we select none.

Let's create the sample tables to illustrate our needs:

Table creation details

We have two tables with 10,000,000 records in each of them. The first table has 3,000,000 distinct value's, while the seconds one has only 3.

We can use COUNT(*) to filter out the values that don't have duplicates. To avoid selecting millions of rows, we will measure the speed by selecting the SUM() if id's from rows that fit our criteria:

SELECT  SUM(id)
FROM    (
        SELECT  id
        FROM    (
                SELECT  value, COUNT(*)
                FROM    t_dup
                GROUP BY value
                HAVING  COUNT(*) > 1
                ) dpd
        JOIN    t_dup dp
        ON      dp.value = dpd.value
        ) q
SUM(id)
50000005000000
1 row fetched in 0.0000s (235.9501s)

Of course, ten million rows is a lot of rows, but 235 seconds is way too much.

How can we improve the query?

First of all, we don't need COUNT(*). It's enough to know that there are duplicates, not the exact number of the duplicates. To filter out the rows, we can just use the following query:

SELECT  SUM(id)
FROM    (
        SELECT  id
        FROM    (
                SELECT  DISTINCT value
                FROM    t_dup
                ) dpd
        JOIN    t_dup dp
        ON      dp.value = dpd.value
        WHERE   EXISTS  (
                SELECT  1
                FROM    t_dup dpi
                WHERE   dpi.value = dpd.value
                LIMIT 1, 1
                )
        ) q
SUM(id)
50000005000000
1 row fetched in 0.0000s (141.0598s)

Much nicer.

But can we improve the query even more?

But looking on the output of EXPLAIN EXTENDED, we see:

DEPENDENT SUBQUERY here means that the subquery cannot be cached and should be reevaluated on each row. This is not too bad, as it will be called for each of distinct value's exactly once anyway. But we have a join with another instance of the table, which is not very efficient.

If we just get rid of the join and just call the subquery for all values instead of calling it for distinct values. It will kill all performance benefits resulted from eliminating the join.

But we can cache the results of the subquery in a session variable, and call the just once for each of the distinct values.

We do it in the following steps:

  1. Select all rows from the table, sorting them by value
  2. Check if value has changed from the previous row, comparing it with the previous result stored in session variable @s
  3. If value has changed, then check the row for duplicates with a subquery, and store the result in the session variable @r. Therefore, @r changes if and only if value changes
  4. Return all rows with non-NULL @r

The query looks like this:

SELECT  SUM(id)
FROM    (
        SELECT  id
        FROM    (
                SELECT  @s := NULL,
                        @r := NULL
                ) vars, t_dup dpo
        WHERE   CASE
                    WHEN @s IS NULL OR @s <> value THEN
                        @r := (
                        SELECT  1
                        FROM    t_dup dpi
                        WHERE   dpi.value = dpo.value
                        LIMIT 1, 1
                        )
                    ELSE
                        @r
                END IS NOT NULL
                AND (@s := value) IS NOT NULL
        ORDER BY
                value
        ) q
SUM(id)
50000005000000
1 row fetched in 0.0000s (87.4676s)

3 times as fast as the the original query, 1.5 times as fast as optimized one using the join.

Written by Quassnoi

March 22nd, 2009 at 11:00 pm

Posted in MySQL

One Response to 'Selecting non-unique rows'

Subscribe to comments with RSS

  1. I really like your articles and hope that you will continue this blog in a future. Thank you for sharing your knowledge.
    One question on the second version of the query.
    I assume that there are no sense to use a derrived query
    (… SELECT DISTINCT value FROM t_dup… ) because we anyway join the results with t_dup and after that use the subquery in WHERE clause to all rows in t_dup. I just removed the derrived query and got the same results. Am I right or may be I miss something?

    The corrected variant of the second query:

    SELECT SUM(id)
    FROM (
    SELECT id, value
    FROM t_dup dpd
    WHERE EXISTS (
    SELECT 1
    FROM t_dup dpi
    WHERE dpi.value = dpd.value
    LIMIT 1, 1
    )
    ) q

    Max

    11 Jan 17 at 14:18

Leave a Reply