Finding incomplete orders

Comments enabled. I *really* need your comment

Imagine we are keeping an online shop and want to find the customers that don't have complete orders.

We'll make the structure of the orders a little bit complex:

Each customer may have a number of baskets
Each basket will have a list of positions in it
Each position has a number of discounts
An order is considered complete, when all entites are present: there is at least one basket, all baskets should have at least one position, and each position should have at least one discount

We will keep the data in four tables as following:

Table creation details

CREATE TABLE t_customer (
cmr_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
cmr_name VARCHAR(50) NOT NULL
) ENGINE=InnoDB;

CREATE TABLE t_basket (
bsk_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
bsk_customer INT NOT NULL,
bsk_date DATETIME NOT NULL,
KEY ix_basket_customer (bsk_customer)
) ENGINE=InnoDB;

CREATE TABLE t_position (
pos_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
pos_basket INT NOT NULL,
pos_item INT NOT NULL,
pos_quantity INT NOT NULL,
KEY ix_position_basket (pos_basket)
) ENGINE=InnoDB;

CREATE TABLE t_discount (
dsc_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
dsc_position INT NOT NULL,
dsc_type INT NOT NULL,
dsc_value DOUBLE NOT NULL,
KEY ix_discount_position (dsc_position)
) ENGINE=InnoDB;

Let's try to find all customers that have at least one incomplete order.

An incomplete order means that either there are no basket, or there are no positions in the basket, or there are no discounts for the position.

In our sample data, customer 1 has no baskets, customer 2 has no positions, and customer 3 has no discounts.

Having no baskets or no positions in a basket naturally also means not having any discounts. That's why we can utilize an outer join to find the missing rows, and filter out the discounts with a NULL id:

SELECT  DISTINCT cmr_id
FROM    t_customer
LEFT OUTER JOIN
t_basket
ON      bsk_customer = cmr_id
LEFT OUTER JOIN
t_position
ON      pos_basket = bsk_id
LEFT OUTER JOIN
t_discount
ON      dsc_position = pos_id
WHERE   dsc_id IS NULL

cmr_id
1
2
3
3 rows fetched in 0,0012s (3,2659s)

This runs for more than 3 seconds.

The query joins all four tables, selects all rows from them and finally filters out the rows that have a NULL dsc_id.

We have more than 100,000 rows in the largest tables, that's why these joins are quite costly.

We can reduce scanning: just find a first matching row for each customer that has a NULL dsc_id and then go to another customer instead of scanning all other baskets, positions and discounts for this one.

This may be achieved by using NOT EXISTS:

SELECT  cmr_id
FROM    t_customer
WHERE   NOT EXISTS
(
SELECT  1
FROM    t_basket, t_position, t_discount
WHERE   bsk_customer = cmr_id
AND pos_basket = bsk_id
AND dsc_position = pos_id
)

cmr_id
1
2
3
3 rows fetched in 0,0012s (0,1618s)

This is much faster.

Now let's change the rules a little. We will consider an order complete only if it has at least two discounts for each position. In our sample data, customer 4 has exactly 1 discount for each position.

This becomes complicated now. We cannot use NOT EXISTS anymore, as it will fail if exactly one discount exists, which is not what we need.

We can of course filter out the positions having less than two discounts using the following query:

SELECT  cmr_id
FROM    t_customer
WHERE   NOT EXISTS
(
SELECT  1
FROM    t_basket
LEFT OUTER JOIN
t_position
ON      pos_basket = bsk_id
LEFT OUTER JOIN
t_discount
ON      dsc_position = pos_id
WHERE   bsk_customer = cmr_id
GROUP BY
bsk_id, pos_id
HAVING  COUNT(*) &gt; 1
)

cmr_id
1
2
3
4
4 rows fetched in 0,0012s (5,6372s)

, which will give us the result we're after. But this query is quite slow.

How can we optimize it?

MySQL uses NESTED LOOPS to perform joins, with t_customer as a leading table. To calculate a COUNT(*), it first need to join all four tables, and then filter out the rows that have COUNT(*) >= 2.

But t_discount actually has all the information we need for filtering. We may first perform the filtering, and then join the other tables with the filtered results. Thus we can get rid of the unneeded joins.

So we'll make our query perform the following steps:

We find all customers that don't have discounts at all. These customers don't have any records in the t_discount, but, fortunately, we already know how to filter them out fast
We find all customers that have positions with exactly one discount. To do this, we select all such positions from t_discount by using COUNT(*), and perform the joins with the other tables. Joins will be performed after fitering.
Finally, we UNION the results of the queries above

Let's see how it works:

SELECT  cmr_id
FROM    t_customer
WHERE   NOT EXISTS
(
SELECT  1
FROM    t_basket, t_position, t_discount
WHERE   bsk_customer = cmr_id
AND pos_basket = bsk_id
AND dsc_position = pos_id
)
UNION
SELECT  DISTINCT bsk_customer
FROM    (
SELECT  dsc_position
FROM    t_discount
GROUP BY
dsc_position
HAVING COUNT(*) &lt; 2
        ) dd, t_position, t_basket
WHERE   pos_id = dsc_position
        AND bsk_id = pos_basket

cmr_id
1
2
3
4
4 rows fetched in 0,0012s (0,7321s)

Much faster.

Written by Quassnoi

March 13th, 2009 at 11:00 pm

Posted in MySQL

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

EXPLAIN EXTENDED