MySQL: GROUP BY in UNION
From Stack Overflow:
I have a query where I have a custom developed UDF that is used to calculate whether or not certain points are within a polygon (first query in
UNION) or circular (second query inUNION) shape.SELECT a.geo_boundary_id, … FROM geo_boundary_vertex a, … … GROUP BY a.geo_boundary_id UNION SELECT b.geo_boundary_id, … FROM geo_boundary b, … … GROUP BY b.geo_boundary_idWhen I run an explain for the query I get
filesortfor both queries within theUNION.Now, I can split the queries up and use the
ORDER BY NULLtrick to get rid of thefilesorthowever when I attempt to add that to the end of aUNIONit doesn't work.How do I get rid of the
filesort?
In MySQL, GROUP BY also implies ORDER BY on the same set of expressions in the same order. That's why it adds an additional filesort operation to sort the resultset if it does not come out naturally sorted (say, from an index).
This is not always a desired behavior, and MySQL manual suggests adding ORDER BY NULL to the queries where sorting is not required. This can improve performance of the queries.
Let's create a sample table and see:
The table contains 3,000,000 random records with value1 and value2 between 1 and 300,000.
Here's the plan we get with a mere UNION of two GROUP BY queries:
SELECT value1 AS value FROM grouping GROUP BY value1 UNION SELECT value2 AS value FROM grouping GROUP BY value2 LIMIT 10
| value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 rows fetched in 0.0002s (8.4998s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | grouping | ALL | 3000279 | 100.00 | Using temporary; Using filesort | ||||
| 2 | UNION | grouping | ALL | 3000279 | 100.00 | Using temporary; Using filesort | ||||
| UNION RESULT | <union1,2> | ALL |
select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` union select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` limit 10
We see that there is a filesort in each of the queries.
MySQL does allow using ORDER BY in the queries merged with UNION or UNION ALL. To do this, we just need to wrap each query into a set of parentheses:
( SELECT value1 AS value FROM grouping GROUP BY value1 ORDER BY NULL ) UNION ( SELECT value2 AS value FROM grouping GROUP BY value2 ORDER BY NULL ) LIMIT 10
| value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 rows fetched in 0.0002s (8.4792s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | grouping | ALL | 3000279 | 100.00 | Using temporary; Using filesort | ||||
| 2 | UNION | grouping | ALL | 3000279 | 100.00 | Using temporary; Using filesort | ||||
| UNION RESULT | <union1,2> | ALL |
(select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` order by NULL) union (select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` order by NULL) limit 10
However, the plan remained the same. Why?
MySQL's documentation says:
Use of
ORDER BYfor individualSELECTstatements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use ofORDER BYin this context is typically in conjunction withLIMIT, so that it is used to determine the subset of the selected rows to retrieve for theSELECT, even though it does not necessarily affect the order of those rows in the finalUNIONresult. IfORDER BYappears withoutLIMITin aSELECT, it is optimized away because it will have no effect anyway.
This means that the optimizer just removes ORDER BY from the UNION parts if they are not used along with LIMIT.
This is of course a good idea: since individual ORDER BY have no effect on the order of the final query anyway, there is no use in executing them or even taking them into account.
However, this idea would be much better if the same were also true for GROUP BY. Currently, the optimizer does not optimize away the ordering behavior of the GROUP BY queries which are parts of a UNION and they cannot be cured with ORDER BY NULL (whose only goal is not to order) since this is removed by the optimizer.
However, since only ORDER BY clauses not accompanied with a LIMIT are thrown away, we could just add a LIMIT. Of course it should be large enough to guarantee that all record would be returned.
Let's see:
( SELECT value1 AS value FROM grouping GROUP BY value1 ORDER BY NULL LIMIT 10000000000 ) UNION ( SELECT value2 AS value FROM grouping GROUP BY value2 ORDER BY NULL LIMIT 10000000000 ) LIMIT 10
| value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12462 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 205466 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 89941 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 133309 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 96722 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 83683 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 128249 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 90196 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 66232 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 60571 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 rows fetched in 0.0002s (6.9842s) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | grouping | ALL | 3000279 | 100.00 | Using temporary | ||||
| 2 | UNION | grouping | ALL | 3000279 | 100.00 | Using temporary | ||||
| UNION RESULT | <union1,2> | ALL |
(select `20110330_group`.`grouping`.`value1` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value1` order by NULL limit 10000000000) union (select `20110330_group`.`grouping`.`value2` AS `value` from `20110330_group`.`grouping` group by `20110330_group`.`grouping`.`value2` order by NULL limit 10000000000) limit 10
Now, there are no filesort operations in the plan and the query runs 20% faster.
This is not a very elegant solution of course. More than that, a solution similar to it was used for SQL Server 2000 which does not allow using ORDER BY without a TOP in the inline views. In SQL Server 2000, TOP 100% forced the order of the nested queries and usually made them spooled or otherwise materialized.
It was quite a nasty surprise when SQL Server 2005 has improved
its optimizer to detect such tricks and ignore ORDER BY for TOP 100% queries. However, with all improvements introduced in SQL Server 2005, most of these queries could be just rewritten in a more clean and efficient way.
Nevertheless, this solution is still safe to use, because it does not change the semantics of the query (if LIMIT is chosen large enough), but is just an optimizer hack. In the worst case, the query will just become as slow as it initially was, and an extra filesort is not the worst of all things that can happen to a query.
Meanwhile, I've posted it as bug 60702 to MySQL bug database and am hoping they'll fix it in the next release.
Subscribe in a reader
I’m not sure what your workaround is doing, but the results for your workaround are different from the original query, so the claim that its faster doesn’t really apply – if the results are not the same, its not the same query :).
As I understand it, a group by will necessarily have to sort the data in order to group it. I suppose you could group by without sorting, but I doubt it will be as efficient.
Raj
1 Apr 11 at 00:05
@Raj: I didn’t append an
ORDER BYto the trailing result so that it is obvious that the last query does not sort. You may add anORDER BYbefore the trailingLIMITif you like and see that the results are identical, the latter query still being more efficient. This is not the point of the article.Grouping does not necessarily sort the data, and a
sort group byis not necessarily the most efficient method. In fact, thefilesortyou see in the plans is used to sort the data after they were grouped.Quassnoi
1 Apr 11 at 02:07