Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS at EXPLAIN EXTENDED

Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS

Answering questions asked on the site.

Andrew Stillard asks:

I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.

Users search for products based on the options via an Advanced Search menu.

The users need to be able to select multiple options upon which to query. I would normally use a JOIN if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through the WHERE EXISTS statement.

The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.

The option table mentioned here is in fact an implementation of the EAV model in a relational database.

Each record basically contains 3 things: id of the product it describes; id of the option and the value of the given option for the given product. These fields represent entity, attribute and value, respectively.

This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.

However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.

There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:

For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using COUNT(*) as a filter. The number of entity entries should be equal to the number of the attributes.
Takes each entity and for each attribute, check if the condition holds.

The first approach uses a GROUP BY, the second one uses EXISTS.

Let's create a sample table and see which one is more efficient:

Table creation details

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_product (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_option (
product_id INT NOT NULL,
option_id INT NOT NULL,
value INT NOT NULL,
PRIMARY KEY pk_option_o_p (product_id, option_id),
KEY ix_option_o_v (option_id, value)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt &lt;= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    t_product
SELECT  id, CONCAT('Product ', id)
FROM    filler
ORDER BY
        id
LIMIT 50000;

INSERT
INTO    t_option
SELECT  p.id, f.id, CEILING(RAND(20100402 &lt;&lt; 1) * 100)
FROM    t_product p
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        ORDER BY
                id
        LIMIT 40
        ) f;

Table t_product contains 50,000 products; table t_option (an EAV table) contains the values of 40 attributes for each products, randomly filled.

Assume we got a complex query involving attributes from 1 to 6. The value of attribute 1 should be from 0 to 100; that of attribute 2 should be from 10 to 90, etc., finally, the value of the attribute 6 should be from 50 to 50.

GROUP BY

Using the first approach, we should find all product_id's in the EAV table (t_option) that satisfy the ranges for each attribute, then aggregate these products and filter those whose COUNT(*) is 6. This will mean that the product was found in all 6 ranges and hence satisfies all conditions.

Here's the query to do that:

SELECT  COUNT(*)
FROM    (
SELECT  product_id
FROM    (
SELECT  product_id
FROM    (
SELECT  1 AS opt, 0 AS l, 100 AS h
UNION ALL
SELECT  2 AS opt, 10 AS l, 90 AS h
UNION ALL
SELECT  3 AS opt, 20 AS l, 80 AS h
UNION ALL
SELECT  4 AS opt, 30 AS l, 70 AS h
UNION ALL
SELECT  5 AS opt, 40 AS l, 60 AS h
UNION ALL
SELECT  6 AS opt, 50 AS l, 50 AS h
) v
JOIN    t_option o
ON      o.option_id &gt;= opt
AND o.option_id &lt;= opt
                        AND o.value BETWEEN l AND h
                ) o
        GROUP BY
                o.product_id
        HAVING  COUNT(*) = 6
        ) q;

COUNT(*)
22
1 row fetched in 0.0001s (0.3291s)

id	select_type	table	type	possible_keys	key	key_len	rows	filtered	Extra
1	PRIMARY								Select tables optimized away
2	DERIVED	<derived3>	ALL				152657	100.00	Using temporary; Using filesort
3	DERIVED	<derived4>	ALL				6	100.00
3	DERIVED	o	range	ix_option_o_v	ix_option_o_v	8	1090	183532.11	Range checked for each record (index map: 0x2)
4	DERIVED								No tables used
5	UNION								No tables used
6	UNION								No tables used
7	UNION								No tables used
8	UNION								No tables used
9	UNION								No tables used
	UNION RESULT	<union4,5,6,7,8,9>	ALL

select count(0) AS `COUNT(*)` from (select `o`.`product_id` AS `product_id` from (select `20100402_options`.`o`.`product_id` AS `product_id` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` join `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`option_id` >= `v`.`opt`) and (`20100402_options`.`o`.`option_id` <= `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))) `o` group by `o`.`product_id` having (count(0) = 6)) `q`

Note that there are two quite counterintuitive things in the query:

o.option_id >= opt AND o.option_id <= opt instead of mere o.option_id = opt
GROUP BY is applied to a nested query rather than being put immediately after the WHERE clause

As many readers of my blog will remember, these tricks are intended to make MySQL use Range checked for each record that we can spot in the plan. If not for these tricks, MySQL would use an index scan on only a part of the composite index on (attribute, value, product), since the joins on the mixed equality/range conditions are not optimized well in the current releases (at least up to 5.1.42).

This query is quite fast: only 320 ms.

NOT EXISTS

Using this approach, the engine takes each product and checks if all of the relevant attributes satisfy the conditions. As soon as the first attribute failing the check is found, the condition is considered FALSE and further evaluation ceases.

This approach does not use aggregation (which is good), but it needs to browse all products and do random index seeks for each of the products (which is bad).

Here's the query:

SELECT  COUNT(*)
FROM    t_product p
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    (
SELECT  1 AS opt, 0 AS l, 100 AS h
UNION ALL
SELECT  2 AS opt, 10 AS l, 90 AS h
UNION ALL
SELECT  3 AS opt, 20 AS l, 80 AS h
UNION ALL
SELECT  4 AS opt, 30 AS l, 70 AS h
UNION ALL
SELECT  5 AS opt, 40 AS l, 60 AS h
UNION ALL
SELECT  6 AS opt, 50 AS l, 50 AS h
) v
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_option o
WHERE   o.product_id = p.id
AND o.option_id = opt
AND o.value BETWEEN l AND h
)
);

COUNT(*)
22
1 row fetched in 0.0001s (0.8438s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	p	index		PRIMARY	4		50115	100.00	Using where; Using index
2	DEPENDENT SUBQUERY	<derived3>	ALL					6	100.00	Using where
9	DEPENDENT SUBQUERY	o	eq_ref	PRIMARY,ix_option_o_v	PRIMARY	8	20100402_options.p.id,v.opt	1	100.00	Using where
3	DERIVED									No tables used
4	UNION									No tables used
5	UNION									No tables used
6	UNION									No tables used
7	UNION									No tables used
8	UNION									No tables used
	UNION RESULT	<union3,4,5,6,7,8>	ALL

Field or reference '20100402_options.p.id' of SELECT #9 was resolved in SELECT #1
Field or reference 'v.opt' of SELECT #9 was resolved in SELECT #2
Field or reference 'v.l' of SELECT #9 was resolved in SELECT #2
Field or reference 'v.h' of SELECT #9 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_product` `p` where (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`p`.`id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))))))))

This query runs for 843 ms, or three times as long as its GROUP BY counterpart.

GROUP BY version seems to be the most efficient, but can we optimize it somehow?

Mixing two approaches

The main problem with the GROUP BY approach is, well, GROUP BY. In MySQL, it requires sorting or materialization (or both) which are quite expensive operations. Also, the recordset for each of the attributes is selected independently: no filter on one of the attributes affects the other ones. The filtering is only done after the aggregation.

The main problem with the NOT EXISTS approach is that every product should be checked for the conditions, though only a tiny fraction of all products satisfies at least several of them. Checking for conditions also requires index seeks in the blocks that are quite far away from each other which is not good for performance (especially if the index does not fit completely into the cache).

However, we can mix the two approaches.

Instead of using the product table as a source for all possible entities to check against the EAV table, we can take the condition on only one attribute that is the least likely to be satisfied.

Since entity/attribute pair form a natural primary key in the EAV table, filtering on an attribute is guaranteed to result in a set of unique entities. Each of these entities should then be further checked using the NOT EXISTS approach, but this time there are much fewer checks that need to be made, since with a properly chosen primary condition (which should be the most strong one) most of the non-matching entities are already filtered out.

This approach, on the one hand, does not require aggregation; on the other hand, the conditions are sieved rather then added up.

Let's build the query:

SELECT  COUNT(*)
FROM    t_option oo
WHERE   oo.option_id = 6
AND oo.value BETWEEN 50 AND 50
AND NOT EXISTS
(
SELECT  NULL
FROM    (
SELECT  1 AS opt, 0 AS l, 100 AS h
UNION ALL
SELECT  2 AS opt, 10 AS l, 90 AS h
UNION ALL
SELECT  3 AS opt, 20 AS l, 80 AS h
UNION ALL
SELECT  4 AS opt, 30 AS l, 70 AS h
UNION ALL
SELECT  5 AS opt, 40 AS l, 60 AS h
) v
WHERE   NOT EXISTS
(
SELECT  NULL
FROM    t_option o
WHERE   o.product_id = oo.product_id
AND o.option_id = opt
AND o.value BETWEEN l AND h
)
);

COUNT(*)
22
1 row fetched in 0.0001s (0.0105s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	oo	ref	ix_option_o_v	ix_option_o_v	8	const,const	1090	100.00	Using where; Using index
2	DEPENDENT SUBQUERY	<derived3>	ALL					5	100.00	Using where
8	DEPENDENT SUBQUERY	o	eq_ref	PRIMARY,ix_option_o_v	PRIMARY	8	20100402_options.oo.product_id,v.opt	1	100.00	Using where
3	DERIVED									No tables used
4	UNION									No tables used
5	UNION									No tables used
6	UNION									No tables used
7	UNION									No tables used
	UNION RESULT	<union3,4,5,6,7>	ALL

Field or reference '20100402_options.oo.product_id' of SELECT #8 was resolved in SELECT #1
Field or reference 'v.opt' of SELECT #8 was resolved in SELECT #2
Field or reference 'v.l' of SELECT #8 was resolved in SELECT #2
Field or reference 'v.h' of SELECT #8 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_option` `oo` where ((`20100402_options`.`oo`.`option_id` = 6) and (`20100402_options`.`oo`.`value` between 50 and 50) and (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`oo`.`product_id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`)))))))))

As we can see, using the most selective condition as an initial row source for the further checks greatly improved the query speed.

The query time is now only 10 ms, or 30 times as fast as the GROUP BY approach.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

April 2nd, 2010 at 11:00 pm

Posted in MySQL

5 Responses to 'Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS'

Subscribe to comments with RSS

Thank you very much for the article! But what if i want to select several ranges for 1 option? For example i want to select option 1 from 10 to 10 OR from 14 to 14 OR from 22 to 22. If any of given ranges is found, consider it a valid selection. In real life i mean there are for example colors: white, black, green, orange, yellow. I want to select all products having white OR black OR orange color. Color here is an option and option id’s are for example 10 (white), 14 (black), 22 (orange). So i mean i don’t want to select from 10 to 22. I want specific numbers.

phpcoder

28 Oct 12 at 18:15

@phpcoder: if you want several values for a single option, just use this:

SELECT  product_id
FROM    t_option
WHERE   option_id = 1 -- color
        AND value IN (10, 14, 22) -- white, black, orange

Quassnoi

28 Oct 12 at 19:30

Quassnoi, thanks for your reply.

My bad, i wasn’t clear on what i want. I meant – is it possible to adapt your optimized query to select product ids in such way that i can specify (for example) that option 1 will be in range from 10 to 10 OR in range from 20 to 25 OR in range from 30 to 30 (several ranges, if found at least in one of them – count that product), while leaving other options as they are (or making some other options with several ranges, too).

For example if i have product with different colors (white, black, orange) AND it’s size is `XXL` (option_id=2, value=100) and it’s season is `summer` (option_id=3, value = 200) then i want mysql query to give me back any product with color of white,black or orange AND it’s size is `XXL` AND it’s season is `summer`. So if it’s available in black and it’s `XXL` and it’s `summer` then it’s counted.

On a website it could look something like this: user picks several colors + he selects size + season. Then i want to show him how many products left with given filters (for example count products in all categories having filtered attributes).

Currently query works great if i want option1 in ONE range AND option2 in only ONE range AND … etc

I think something like this could work, but i only understand how to apply it to your first (GROUP BY) example:

SELECT COUNT(*)
FROM (
SELECT product_id
FROM (
SELECT product_id
FROM (
SELECT 1 AS opt, 1 AS l, 1 AS h
UNION ALL
SELECT 1 AS opt, 3 AS l, 3 AS h
UNION ALL
SELECT 5 AS opt, 17 AS l, 17 AS h
UNION ALL
SELECT 6 AS opt, 22 AS l, 22 AS h
) v
JOIN article_attributes o
ON o.attribute_id >= opt
AND o.attribute_id <= opt
AND o.attribute_value BETWEEN l AND h
GROUP BY product_id, attribute_id
) o
GROUP BY
o.product_id
HAVING COUNT(*) = 3
) q;

— here for example we can assume that there is green color (option=1, value=1) OR purple (option=1, value=3) color AND size `XL` (option=5, value=17)
— AND season = `winter` (option=6, value=22)
— So if we `GROUP BY product_id, attribute_id` in subquery, it will count correctly, even if i want at least color1 OR at least color2 for the product_id

The only question is how to apply it to optimized query? (i'm not a pro, so maybe i don't fully understand some details, but i try to get it)

phpcoder

28 Oct 12 at 22:02
@phpcoder: got it. If you want a “sieve” solution you will need to throw in another EXISTS.

Will try to cover it soon, watch this space.

Quassnoi

29 Oct 12 at 22:55
Thank you! I’m dying to see it :)

phpcoder

30 Oct 12 at 00:16

EXPLAIN EXTENDED