EXPLAIN EXTENDED

How to create fast database queries

MySQL: GROUP BY in UNION

with 2 comments

From Stack Overflow:

I have a query where I have a custom developed UDF that is used to calculate whether or not certain points are within a polygon (first query in UNION) or circular (second query in UNION) shape.

SELECT  a.geo_boundary_id, …
FROM     geo_boundary_vertex a, …
…
GROUP BY
a.geo_boundary_id
UNION
SELECT  b.geo_boundary_id, …
FROM     geo_boundary b, …
…
GROUP BY
b.geo_boundary_id

When I run an explain for the query I get filesort for both queries within the UNION.

Now, I can split the queries up and use the ORDER BY NULL trick to get rid of the filesort however when I attempt to add that to the end of a UNION it doesn't work.

How do I get rid of the filesort?

In MySQL, GROUP BY also implies ORDER BY on the same set of expressions in the same order. That's why it adds an additional filesort operation to sort the resultset if it does not come out naturally sorted (say, from an index).

This is not always a desired behavior, and MySQL manual suggests adding ORDER BY NULL to the queries where sorting is not required. This can improve performance of the queries.

Let's create a sample table and see:

Table creation details

The table contains 3,000,000 random records with value1 and value2 between 1 and 300,000.

Here's the plan we get with a mere UNION of two GROUP BY queries:

SELECT  value1 AS value
FROM    grouping
GROUP BY
value1
UNION
SELECT  value2 AS value
FROM    grouping
GROUP BY
value2
LIMIT 10

value
1
2
3
4
5
6
7
8
9
10
10 rows fetched in 0.0002s (8.4998s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY grouping ALL 3000279 100.00 Using temporary; Using filesort
2 UNION grouping ALL 3000279 100.00 Using temporary; Using filesort
UNION RESULT <union1,2> ALL
select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` union select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` limit 10

We see that there is a filesort in each of the queries.

MySQL does allow using ORDER BY in the queries merged with UNION or UNION ALL. To do this, we just need to wrap each query into a set of parentheses:

(
SELECT  value1 AS value
FROM    grouping
GROUP BY
value1
ORDER BY
NULL
)
UNION
(
SELECT  value2 AS value
FROM    grouping
GROUP BY
value2
ORDER BY
NULL
)
LIMIT 10

value
1
2
3
4
5
6
7
8
9
10
10 rows fetched in 0.0002s (8.4792s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY grouping ALL 3000279 100.00 Using temporary; Using filesort
2 UNION grouping ALL 3000279 100.00 Using temporary; Using filesort
UNION RESULT <union1,2> ALL
(select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` order by NULL) union (select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` order by NULL) limit 10

However, the plan remained the same. Why?

MySQL's documentation says:

Use of ORDER BY for individual SELECT statements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use of ORDER BY in this context is typically in conjunction with LIMIT, so that it is used to determine the subset of the selected rows to retrieve for the SELECT, even though it does not necessarily affect the order of those rows in the final UNION result. If ORDER BY appears without LIMIT in a SELECT, it is optimized away because it will have no effect anyway.

This means that the optimizer just removes ORDER BY from the UNION parts if they are not used along with LIMIT.

This is of course a good idea: since individual ORDER BY have no effect on the order of the final query anyway, there is no use in executing them or even taking them into account.

However, this idea would be much better if the same were also true for GROUP BY. Currently, the optimizer does not optimize away the ordering behavior of the GROUP BY queries which are parts of a UNION and they cannot be cured with ORDER BY NULL (whose only goal is not to order) since this is removed by the optimizer.

However, since only ORDER BY clauses not accompanied with a LIMIT are thrown away, we could just add a LIMIT. Of course it should be large enough to guarantee that all record would be returned.

Let's see:

(
SELECT  value1 AS value
FROM    grouping
GROUP BY
value1
ORDER BY
NULL
LIMIT 10000000000
)
UNION
(
SELECT  value2 AS value
FROM    grouping
GROUP BY
value2
ORDER BY
NULL
LIMIT 10000000000
)
LIMIT 10

value
12462
205466
89941
133309
96722
83683
128249
90196
66232
60571
10 rows fetched in 0.0002s (6.9842s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY grouping ALL 3000279 100.00 Using temporary
2 UNION grouping ALL 3000279 100.00 Using temporary
UNION RESULT <union1,2> ALL
(select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` order by NULL limit 10000000000) union (select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` order by NULL limit 10000000000) limit 10

Now, there are no filesort operations in the plan and the query runs 20% faster.

This is not a very elegant solution of course. More than that, a solution similar to it was used for SQL Server 2000 which does not allow using ORDER BY without a TOP in the inline views. In SQL Server 2000, TOP 100% forced the order of the nested queries and usually made them spooled or otherwise materialized.

It was quite a nasty surprise when SQL Server 2005 has improved its optimizer to detect such tricks and ignore ORDER BY for TOP 100% queries. However, with all improvements introduced in SQL Server 2005, most of these queries could be just rewritten in a more clean and efficient way.

Nevertheless, this solution is still safe to use, because it does not change the semantics of the query (if LIMIT is chosen large enough), but is just an optimizer hack. In the worst case, the query will just become as slow as it initially was, and an extra filesort is not the worst of all things that can happen to a query.

Meanwhile, I've posted it as bug 60702 to MySQL bug database and am hoping they'll fix it in the next release.

Written by Quassnoi

March 30th, 2011 at 11:00 pm

Posted in MySQL

2 Responses to 'MySQL: GROUP BY in UNION'

Subscribe to comments with RSS

  1. I’m not sure what your workaround is doing, but the results for your workaround are different from the original query, so the claim that its faster doesn’t really apply – if the results are not the same, its not the same query :).

    As I understand it, a group by will necessarily have to sort the data in order to group it. I suppose you could group by without sorting, but I doubt it will be as efficient.

    Raj

    1 Apr 11 at 00:05

  2. @Raj: I didn’t append an ORDER BY to the trailing result so that it is obvious that the last query does not sort. You may add an ORDER BY before the trailing LIMIT if you like and see that the results are identical, the latter query still being more efficient. This is not the point of the article.

    Grouping does not necessarily sort the data, and a sort group by is not necessarily the most efficient method. In fact, the filesort you see in the plans is used to sort the data after they were grouped.

    Quassnoi

    1 Apr 11 at 02:07

Leave a Reply