EXPLAIN EXTENDED

How to create fast database queries

Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS

with 5 comments

Answering questions asked on the site.

Andrew Stillard asks:

I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.

Users search for products based on the options via an Advanced Search menu.

The users need to be able to select multiple options upon which to query. I would normally use a JOIN if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through the WHERE EXISTS statement.

The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.

The option table mentioned here is in fact an implementation of the EAV model in a relational database.

Each record basically contains 3 things: id of the product it describes; id of the option and the value of the given option for the given product. These fields represent entity, attribute and value, respectively.

This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.

However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.

There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:

  1. For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using COUNT(*) as a filter. The number of entity entries should be equal to the number of the attributes.
  2. Takes each entity and for each attribute, check if the condition holds.

The first approach uses a GROUP BY, the second one uses EXISTS.

Let's create a sample table and see which one is more efficient:

Table creation details

Table t_product contains 50,000 products; table t_option (an EAV table) contains the values of 40 attributes for each products, randomly filled.

Assume we got a complex query involving attributes from 1 to 6. The value of attribute 1 should be from 0 to 100; that of attribute 2 should be from 10 to 90, etc., finally, the value of the attribute 6 should be from 50 to 50.

GROUP BY

Using the first approach, we should find all product_id's in the EAV table (t_option) that satisfy the ranges for each attribute, then aggregate these products and filter those whose COUNT(*) is 6. This will mean that the product was found in all 6 ranges and hence satisfies all conditions.

Here's the query to do that:

SELECT  COUNT(*)
FROM    (
SELECT  product_id
FROM    (
SELECT  product_id
FROM    (
SELECT  1 AS opt, 0 AS l, 100 AS h
UNION ALL
SELECT  2 AS opt, 10 AS l, 90 AS h
UNION ALL
SELECT  3 AS opt, 20 AS l, 80 AS h
UNION ALL
SELECT  4 AS opt, 30 AS l, 70 AS h
UNION ALL
SELECT  5 AS opt, 40 AS l, 60 AS h
UNION ALL
SELECT  6 AS opt, 50 AS l, 50 AS h
) v
JOIN    t_option o
ON      o.option_id >= opt
AND o.option_id <= opt
                        AND o.value BETWEEN l AND h
                ) o
        GROUP BY
                o.product_id
        HAVING  COUNT(*) = 6
        ) q;

COUNT(*)
22
1 row fetched in 0.0001s (0.3291s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY Select tables optimized away
2 DERIVED <derived3> ALL 152657 100.00 Using temporary; Using filesort
3 DERIVED <derived4> ALL 6 100.00
3 DERIVED o range ix_option_o_v ix_option_o_v 8 1090 183532.11 Range checked for each record (index map: 0x2)
4 DERIVED No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
8 UNION No tables used
9 UNION No tables used
UNION RESULT <union4,5,6,7,8,9> ALL
select count(0) AS `COUNT(*)` from (select `o`.`product_id` AS `product_id` from (select `20100402_options`.`o`.`product_id` AS `product_id` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` join `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`option_id` >= `v`.`opt`) and (`20100402_options`.`o`.`option_id` <= `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))) `o` group by `o`.`product_id` having (count(0) = 6)) `q`

Note that there are two quite counterintuitive things in the query:

  • o.option_id >= opt AND o.option_id <= opt instead of mere o.option_id = opt
  • GROUP BY is applied to a nested query rather than being put immediately after the WHERE clause

As many readers of my blog will remember, these tricks are intended to make MySQL use Range checked for each record that we can spot in the plan. If not for these tricks, MySQL would use an index scan on only a part of the composite index on (attribute, value, product), since the joins on the mixed equality/range conditions are not optimized well in the current releases (at least up to 5.1.42).

This query is quite fast: only 320 ms.

NOT EXISTS

Using this approach, the engine takes each product and checks if all of the relevant attributes satisfy the conditions. As soon as the first attribute failing the check is found, the condition is considered FALSE and further evaluation ceases.

This approach does not use aggregation (which is good), but it needs to browse all products and do random index seeks for each of the products (which is bad).

Here's the query:

SELECT  COUNT(*)
FROM    t_product p
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    (
SELECT  1 AS opt, 0 AS l, 100 AS h
UNION ALL
SELECT  2 AS opt, 10 AS l, 90 AS h
UNION ALL
SELECT  3 AS opt, 20 AS l, 80 AS h
UNION ALL
SELECT  4 AS opt, 30 AS l, 70 AS h
UNION ALL
SELECT  5 AS opt, 40 AS l, 60 AS h
UNION ALL
SELECT  6 AS opt, 50 AS l, 50 AS h
) v
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_option o
WHERE   o.product_id = p.id
AND o.option_id = opt
AND o.value BETWEEN l AND h
)
);
COUNT(*)
22
1 row fetched in 0.0001s (0.8438s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY p index PRIMARY 4 50115 100.00 Using where; Using index
2 DEPENDENT SUBQUERY <derived3> ALL 6 100.00 Using where
9 DEPENDENT SUBQUERY o eq_ref PRIMARY,ix_option_o_v PRIMARY 8 20100402_options.p.id,v.opt 1 100.00 Using where
3 DERIVED No tables used
4 UNION No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
8 UNION No tables used
UNION RESULT <union3,4,5,6,7,8> ALL
Field or reference '20100402_options.p.id' of SELECT #9 was resolved in SELECT #1
Field or reference 'v.opt' of SELECT #9 was resolved in SELECT #2
Field or reference 'v.l' of SELECT #9 was resolved in SELECT #2
Field or reference 'v.h' of SELECT #9 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_product` `p` where (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`p`.`id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))))))))

This query runs for 843 ms, or three times as long as its GROUP BY counterpart.

GROUP BY version seems to be the most efficient, but can we optimize it somehow?

Mixing two approaches

The main problem with the GROUP BY approach is, well, GROUP BY. In MySQL, it requires sorting or materialization (or both) which are quite expensive operations. Also, the recordset for each of the attributes is selected independently: no filter on one of the attributes affects the other ones. The filtering is only done after the aggregation.

The main problem with the NOT EXISTS approach is that every product should be checked for the conditions, though only a tiny fraction of all products satisfies at least several of them. Checking for conditions also requires index seeks in the blocks that are quite far away from each other which is not good for performance (especially if the index does not fit completely into the cache).

However, we can mix the two approaches.

Instead of using the product table as a source for all possible entities to check against the EAV table, we can take the condition on only one attribute that is the least likely to be satisfied.

Since entity/attribute pair form a natural primary key in the EAV table, filtering on an attribute is guaranteed to result in a set of unique entities. Each of these entities should then be further checked using the NOT EXISTS approach, but this time there are much fewer checks that need to be made, since with a properly chosen primary condition (which should be the most strong one) most of the non-matching entities are already filtered out.

This approach, on the one hand, does not require aggregation; on the other hand, the conditions are sieved rather then added up.

Let's build the query:

SELECT  COUNT(*)
FROM    t_option oo
WHERE   oo.option_id = 6
AND oo.value BETWEEN 50 AND 50
AND NOT EXISTS
(
SELECT  NULL
FROM    (
SELECT  1 AS opt, 0 AS l, 100 AS h
UNION ALL
SELECT  2 AS opt, 10 AS l, 90 AS h
UNION ALL
SELECT  3 AS opt, 20 AS l, 80 AS h
UNION ALL
SELECT  4 AS opt, 30 AS l, 70 AS h
UNION ALL
SELECT  5 AS opt, 40 AS l, 60 AS h
) v
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_option o
WHERE   o.product_id = oo.product_id
AND o.option_id = opt
AND o.value BETWEEN l AND h
)
);

COUNT(*)
22
1 row fetched in 0.0001s (0.0105s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY oo ref ix_option_o_v ix_option_o_v 8 const,const 1090 100.00 Using where; Using index
2 DEPENDENT SUBQUERY <derived3> ALL 5 100.00 Using where
8 DEPENDENT SUBQUERY o eq_ref PRIMARY,ix_option_o_v PRIMARY 8 20100402_options.oo.product_id,v.opt 1 100.00 Using where
3 DERIVED No tables used
4 UNION No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
UNION RESULT <union3,4,5,6,7> ALL
Field or reference '20100402_options.oo.product_id' of SELECT #8 was resolved in SELECT #1
Field or reference 'v.opt' of SELECT #8 was resolved in SELECT #2
Field or reference 'v.l' of SELECT #8 was resolved in SELECT #2
Field or reference 'v.h' of SELECT #8 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_option` `oo` where ((`20100402_options`.`oo`.`option_id` = 6) and (`20100402_options`.`oo`.`value` between 50 and 50) and (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`oo`.`product_id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`)))))))))

As we can see, using the most selective condition as an initial row source for the further checks greatly improved the query speed.

The query time is now only 10 ms, or 30 times as fast as the GROUP BY approach.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

April 2nd, 2010 at 11:00 pm

Posted in MySQL

5 Responses to 'Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS'

Subscribe to comments with RSS

  1. Thank you very much for the article! But what if i want to select several ranges for 1 option? For example i want to select option 1 from 10 to 10 OR from 14 to 14 OR from 22 to 22. If any of given ranges is found, consider it a valid selection. In real life i mean there are for example colors: white, black, green, orange, yellow. I want to select all products having white OR black OR orange color. Color here is an option and option id’s are for example 10 (white), 14 (black), 22 (orange). So i mean i don’t want to select from 10 to 22. I want specific numbers.

    phpcoder

    28 Oct 12 at 18:15

  2. @phpcoder: if you want several values for a single option, just use this:

    SELECT  product_id
    FROM    t_option
    WHERE   option_id = 1 -- color
            AND value IN (10, 14, 22) -- white, black, orange
    

    Quassnoi

    28 Oct 12 at 19:30

  3. Quassnoi, thanks for your reply.

    My bad, i wasn’t clear on what i want. I meant – is it possible to adapt your optimized query to select product ids in such way that i can specify (for example) that option 1 will be in range from 10 to 10 OR in range from 20 to 25 OR in range from 30 to 30 (several ranges, if found at least in one of them – count that product), while leaving other options as they are (or making some other options with several ranges, too).

    For example if i have product with different colors (white, black, orange) AND it’s size is `XXL` (option_id=2, value=100) and it’s season is `summer` (option_id=3, value = 200) then i want mysql query to give me back any product with color of white,black or orange AND it’s size is `XXL` AND it’s season is `summer`. So if it’s available in black and it’s `XXL` and it’s `summer` then it’s counted.

    On a website it could look something like this: user picks several colors + he selects size + season. Then i want to show him how many products left with given filters (for example count products in all categories having filtered attributes).

    Currently query works great if i want option1 in ONE range AND option2 in only ONE range AND … etc

    I think something like this could work, but i only understand how to apply it to your first (GROUP BY) example:

    SELECT COUNT(*)
    FROM (
    SELECT product_id
    FROM (
    SELECT product_id
    FROM (
    SELECT 1 AS opt, 1 AS l, 1 AS h
    UNION ALL
    SELECT 1 AS opt, 3 AS l, 3 AS h
    UNION ALL
    SELECT 5 AS opt, 17 AS l, 17 AS h
    UNION ALL
    SELECT 6 AS opt, 22 AS l, 22 AS h
    ) v
    JOIN article_attributes o
    ON o.attribute_id >= opt
    AND o.attribute_id <= opt
    AND o.attribute_value BETWEEN l AND h
    GROUP BY product_id, attribute_id
    ) o
    GROUP BY
    o.product_id
    HAVING COUNT(*) = 3
    ) q;

    — here for example we can assume that there is green color (option=1, value=1) OR purple (option=1, value=3) color AND size `XL` (option=5, value=17)
    — AND season = `winter` (option=6, value=22)
    — So if we `GROUP BY product_id, attribute_id` in subquery, it will count correctly, even if i want at least color1 OR at least color2 for the product_id

    The only question is how to apply it to optimized query? (i'm not a pro, so maybe i don't fully understand some details, but i try to get it)

    phpcoder

    28 Oct 12 at 22:02

  4. @phpcoder: got it. If you want a “sieve” solution you will need to throw in another EXISTS.

    Will try to cover it soon, watch this space.

    Quassnoi

    29 Oct 12 at 22:55

  5. Thank you! I’m dying to see it :)

    phpcoder

    30 Oct 12 at 00:16

Leave a Reply