Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS
Answering questions asked on the site.
Andrew Stillard asks:
I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.
Users search for products based on the options via an Advanced Search menu.
The users need to be able to select multiple options upon which to query. I would normally use a
JOIN
if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through theWHERE EXISTS
statement.The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.
The option table mentioned here is in fact an implementation of the EAV model in a relational database.
Each record basically contains 3 things: id
of the product it describes; id
of the option and the value of the given option for the given product. These fields represent entity, attribute and value, respectively.
This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.
However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.
There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:
- For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using
COUNT(*)
as a filter. The number of entity entries should be equal to the number of the attributes. - Takes each entity and for each attribute, check if the condition holds.
The first approach uses a GROUP BY
, the second one uses EXISTS
.
Let's create a sample table and see which one is more efficient:
Table creation details
Table t_product
contains 50,000 products; table t_option
(an EAV table) contains the values of 40 attributes for each products, randomly filled.
Assume we got a complex query involving attributes from 1 to 6. The value of attribute 1 should be from 0 to 100; that of attribute 2 should be from 10 to 90, etc., finally, the value of the attribute 6 should be from 50 to 50.
GROUP BY
Using the first approach, we should find all product_id's in the EAV table (t_option) that satisfy the ranges for each attribute, then aggregate these products and filter those whose COUNT(*)
is 6. This will mean that the product was found in all 6 ranges and hence satisfies all conditions.
Here's the query to do that:
SELECT COUNT(*) FROM ( SELECT product_id FROM ( SELECT product_id FROM ( SELECT 1 AS opt, 0 AS l, 100 AS h UNION ALL SELECT 2 AS opt, 10 AS l, 90 AS h UNION ALL SELECT 3 AS opt, 20 AS l, 80 AS h UNION ALL SELECT 4 AS opt, 30 AS l, 70 AS h UNION ALL SELECT 5 AS opt, 40 AS l, 60 AS h UNION ALL SELECT 6 AS opt, 50 AS l, 50 AS h ) v JOIN t_option o ON o.option_id >= opt AND o.option_id <= opt AND o.value BETWEEN l AND h ) o GROUP BY o.product_id HAVING COUNT(*) = 6 ) q;
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.3291s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | Select tables optimized away | ||||||||
2 | DERIVED | <derived3> | ALL | 152657 | 100.00 | Using temporary; Using filesort | ||||
3 | DERIVED | <derived4> | ALL | 6 | 100.00 | |||||
3 | DERIVED | o | range | ix_option_o_v | ix_option_o_v | 8 | 1090 | 183532.11 | Range checked for each record (index map: 0x2) | |
4 | DERIVED | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
8 | UNION | No tables used | ||||||||
9 | UNION | No tables used | ||||||||
UNION RESULT | <union4,5,6,7,8,9> | ALL |
select count(0) AS `COUNT(*)` from (select `o`.`product_id` AS `product_id` from (select `20100402_options`.`o`.`product_id` AS `product_id` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` join `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`option_id` >= `v`.`opt`) and (`20100402_options`.`o`.`option_id` <= `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))) `o` group by `o`.`product_id` having (count(0) = 6)) `q`
Note that there are two quite counterintuitive things in the query:
o.option_id >= opt AND o.option_id <= opt
instead of mereo.option_id = opt
GROUP BY
is applied to a nested query rather than being put immediately after theWHERE
clause
As many readers of my blog will remember, these tricks are intended to make MySQL use Range checked for each record
that we can spot in the plan. If not for these tricks, MySQL would use an index scan on only a part of the composite index on (attribute, value, product)
, since the joins on the mixed equality/range conditions are not optimized well in the current releases (at least up to 5.1.42).
This query is quite fast: only 320 ms.
NOT EXISTS
Using this approach, the engine takes each product and checks if all of the relevant attributes satisfy the conditions. As soon as the first attribute failing the check is found, the condition is considered FALSE and further evaluation ceases.
This approach does not use aggregation (which is good), but it needs to browse all products and do random index seeks for each of the products (which is bad).
Here's the query:
SELECT COUNT(*) FROM t_product p WHERE NOT EXISTS ( SELECT NULL FROM ( SELECT 1 AS opt, 0 AS l, 100 AS h UNION ALL SELECT 2 AS opt, 10 AS l, 90 AS h UNION ALL SELECT 3 AS opt, 20 AS l, 80 AS h UNION ALL SELECT 4 AS opt, 30 AS l, 70 AS h UNION ALL SELECT 5 AS opt, 40 AS l, 60 AS h UNION ALL SELECT 6 AS opt, 50 AS l, 50 AS h ) v WHERE NOT EXISTS ( SELECT NULL FROM t_option o WHERE o.product_id = p.id AND o.option_id = opt AND o.value BETWEEN l AND h ) );
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.8438s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | p | index | PRIMARY | 4 | 50115 | 100.00 | Using where; Using index | ||
2 | DEPENDENT SUBQUERY | <derived3> | ALL | 6 | 100.00 | Using where | ||||
9 | DEPENDENT SUBQUERY | o | eq_ref | PRIMARY,ix_option_o_v | PRIMARY | 8 | 20100402_options.p.id,v.opt | 1 | 100.00 | Using where |
3 | DERIVED | No tables used | ||||||||
4 | UNION | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
8 | UNION | No tables used | ||||||||
UNION RESULT | <union3,4,5,6,7,8> | ALL |
Field or reference '20100402_options.p.id' of SELECT #9 was resolved in SELECT #1 Field or reference 'v.opt' of SELECT #9 was resolved in SELECT #2 Field or reference 'v.l' of SELECT #9 was resolved in SELECT #2 Field or reference 'v.h' of SELECT #9 was resolved in SELECT #2 select count(0) AS `COUNT(*)` from `20100402_options`.`t_product` `p` where (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`p`.`id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))))))))
This query runs for 843 ms, or three times as long as its GROUP BY
counterpart.
GROUP BY version seems to be the most efficient, but can we optimize it somehow?
Mixing two approaches
The main problem with the GROUP BY
approach is, well, GROUP BY
. In MySQL, it requires sorting or materialization (or both) which are quite expensive operations. Also, the recordset for each of the attributes is selected independently: no filter on one of the attributes affects the other ones. The filtering is only done after the aggregation.
The main problem with the NOT EXISTS
approach is that every product should be checked for the conditions, though only a tiny fraction of all products satisfies at least several of them. Checking for conditions also requires index seeks in the blocks that are quite far away from each other which is not good for performance (especially if the index does not fit completely into the cache).
However, we can mix the two approaches.
Instead of using the product table as a source for all possible entities to check against the EAV table, we can take the condition on only one attribute that is the least likely to be satisfied.
Since entity/attribute pair form a natural primary key in the EAV table, filtering on an attribute is guaranteed to result in a set of unique entities. Each of these entities should then be further checked using the NOT EXISTS
approach, but this time there are much fewer checks that need to be made, since with a properly chosen primary condition (which should be the most strong one) most of the non-matching entities are already filtered out.
This approach, on the one hand, does not require aggregation; on the other hand, the conditions are sieved rather then added up.
Let's build the query:
SELECT COUNT(*) FROM t_option oo WHERE oo.option_id = 6 AND oo.value BETWEEN 50 AND 50 AND NOT EXISTS ( SELECT NULL FROM ( SELECT 1 AS opt, 0 AS l, 100 AS h UNION ALL SELECT 2 AS opt, 10 AS l, 90 AS h UNION ALL SELECT 3 AS opt, 20 AS l, 80 AS h UNION ALL SELECT 4 AS opt, 30 AS l, 70 AS h UNION ALL SELECT 5 AS opt, 40 AS l, 60 AS h ) v WHERE NOT EXISTS ( SELECT NULL FROM t_option o WHERE o.product_id = oo.product_id AND o.option_id = opt AND o.value BETWEEN l AND h ) );
COUNT(*) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0001s (0.0105s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | oo | ref | ix_option_o_v | ix_option_o_v | 8 | const,const | 1090 | 100.00 | Using where; Using index |
2 | DEPENDENT SUBQUERY | <derived3> | ALL | 5 | 100.00 | Using where | ||||
8 | DEPENDENT SUBQUERY | o | eq_ref | PRIMARY,ix_option_o_v | PRIMARY | 8 | 20100402_options.oo.product_id,v.opt | 1 | 100.00 | Using where |
3 | DERIVED | No tables used | ||||||||
4 | UNION | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
UNION RESULT | <union3,4,5,6,7> | ALL |
Field or reference '20100402_options.oo.product_id' of SELECT #8 was resolved in SELECT #1 Field or reference 'v.opt' of SELECT #8 was resolved in SELECT #2 Field or reference 'v.l' of SELECT #8 was resolved in SELECT #2 Field or reference 'v.h' of SELECT #8 was resolved in SELECT #2 select count(0) AS `COUNT(*)` from `20100402_options`.`t_option` `oo` where ((`20100402_options`.`oo`.`option_id` = 6) and (`20100402_options`.`oo`.`value` between 50 and 50) and (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`oo`.`product_id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`)))))))))
As we can see, using the most selective condition as an initial row source for the further checks greatly improved the query speed.
The query time is now only 10 ms, or 30 times as fast as the GROUP BY
approach.
Hope that helps.
I'm always glad to answer the questions regarding database queries.
Thank you very much for the article! But what if i want to select several ranges for 1 option? For example i want to select option 1 from 10 to 10 OR from 14 to 14 OR from 22 to 22. If any of given ranges is found, consider it a valid selection. In real life i mean there are for example colors: white, black, green, orange, yellow. I want to select all products having white OR black OR orange color. Color here is an option and option id’s are for example 10 (white), 14 (black), 22 (orange). So i mean i don’t want to select from 10 to 22. I want specific numbers.
phpcoder
28 Oct 12 at 18:15
@phpcoder: if you want several values for a single option, just use this:
Quassnoi
28 Oct 12 at 19:30
Quassnoi, thanks for your reply.
My bad, i wasn’t clear on what i want. I meant – is it possible to adapt your optimized query to select product ids in such way that i can specify (for example) that option 1 will be in range from 10 to 10 OR in range from 20 to 25 OR in range from 30 to 30 (several ranges, if found at least in one of them – count that product), while leaving other options as they are (or making some other options with several ranges, too).
For example if i have product with different colors (white, black, orange) AND it’s size is `XXL` (option_id=2, value=100) and it’s season is `summer` (option_id=3, value = 200) then i want mysql query to give me back any product with color of white,black or orange AND it’s size is `XXL` AND it’s season is `summer`. So if it’s available in black and it’s `XXL` and it’s `summer` then it’s counted.
On a website it could look something like this: user picks several colors + he selects size + season. Then i want to show him how many products left with given filters (for example count products in all categories having filtered attributes).
Currently query works great if i want option1 in ONE range AND option2 in only ONE range AND … etc
I think something like this could work, but i only understand how to apply it to your first (GROUP BY) example:
SELECT COUNT(*)
FROM (
SELECT product_id
FROM (
SELECT product_id
FROM (
SELECT 1 AS opt, 1 AS l, 1 AS h
UNION ALL
SELECT 1 AS opt, 3 AS l, 3 AS h
UNION ALL
SELECT 5 AS opt, 17 AS l, 17 AS h
UNION ALL
SELECT 6 AS opt, 22 AS l, 22 AS h
) v
JOIN article_attributes o
ON o.attribute_id >= opt
AND o.attribute_id <= opt
AND o.attribute_value BETWEEN l AND h
GROUP BY product_id, attribute_id
) o
GROUP BY
o.product_id
HAVING COUNT(*) = 3
) q;
— here for example we can assume that there is green color (option=1, value=1) OR purple (option=1, value=3) color AND size `XL` (option=5, value=17)
— AND season = `winter` (option=6, value=22)
— So if we `GROUP BY product_id, attribute_id` in subquery, it will count correctly, even if i want at least color1 OR at least color2 for the product_id
The only question is how to apply it to optimized query? (i'm not a pro, so maybe i don't fully understand some details, but i try to get it)
phpcoder
28 Oct 12 at 22:02
@phpcoder: got it. If you want a “sieve” solution you will need to throw in another
EXISTS
.Will try to cover it soon, watch this space.
Quassnoi
29 Oct 12 at 22:55
Thank you! I’m dying to see it :)
phpcoder
30 Oct 12 at 00:16