Selecting non-unique rows
Sometimes we need to select all rows for a table that have duplicate values in some of the columns. Like, we want to select all user comments for all posts commented by more than one user.
If there are two or more comments for a post, we select all comments for this post; if there is only one comment, we select none.
Let’s create the sample tables to illustrate our needs:
Table creation details
We have two tables with 10,000,000 records in each of them. The first table has 3,000,000 distinct value‘s, while the seconds one has only 3.
We can use COUNT(*) to filter out the values that don’t have duplicates. To avoid selecting millions of rows, we will measure the speed by selecting the SUM() if id‘s from rows that fit our criteria:
SELECT SUM(id)
FROM (
SELECT id
FROM (
SELECT value, COUNT(*)
FROM t_dup
GROUP BY value
HAVING COUNT(*) > 1
) dpd
JOIN t_dup dp
ON dp.value = dpd.value
) q
| SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50000005000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0000s (235.9501s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Of course, ten million rows is a lot of rows, but 235 seconds is way too much.
How can we improve the query?
First of all, we don’t need COUNT(*). It’s enough to know that there are duplicates, not the exact number of the duplicates. To filter out the rows, we can just use the following query:
SELECT SUM(id)
FROM (
SELECT id
FROM (
SELECT DISTINCT value
FROM t_dup
) dpd
JOIN t_dup dp
ON dp.value = dpd.value
WHERE EXISTS (
SELECT 1
FROM t_dup dpi
WHERE dpi.value = dpd.value
LIMIT 1, 1
)
) q
| SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50000005000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0000s (141.0598s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Much nicer.
But can we improve the query even more?
But looking on the output of EXPLAIN EXTENDED, we see:
DEPENDENT SUBQUERY here means that the subquery cannot be cached and should be reevaluated on each row. This is not too bad, as it will be called for each of distinct value‘s exactly once anyway. But we have a join with another instance of the table, which is not very efficient.
If we just get rid of the join and just call the subquery for all values instead of calling it for distinct values. It will kill all performance benefits resulted from eliminating the join.
But we can cache the results of the subquery in a session variable, and call the just once for each of the distinct values.
We do it in the following steps:
- Select all rows from the table, sorting them by
value - Check if
valuehas changed from the previous row, comparing it with the previous result stored in session variable@s - If value has changed, then check the row for duplicates with a subquery, and store the result in the session variable
@r. Therefore,@rchanges if and only ifvaluechanges - Return all rows with non-
NULL@r
The query looks like this:
SELECT SUM(id)
FROM (
SELECT id
FROM (
SELECT @s := NULL,
@r := NULL
) vars, t_dup dpo
WHERE CASE
WHEN @s IS NULL OR @s <> value THEN
@r := (
SELECT 1
FROM t_dup dpi
WHERE dpi.value = dpo.value
LIMIT 1, 1
)
ELSE
@r
END IS NOT NULL
AND (@s := value) IS NOT NULL
ORDER BY
value
) q
| SUM(id) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50000005000000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 row fetched in 0.0000s (87.4676s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 times as fast as the the original query, 1.5 times as fast as optimized one using the join.
Subscribe in a reader