Finding incomplete orders
Comments enabled. I *really* need your comment
Imagine we are keeping an online shop and want to find the customers that don't have complete orders.
We'll make the structure of the orders a little bit complex:
- Each customer may have a number of baskets
- Each basket will have a list of positions in it
- Each position has a number of discounts
- An order is considered complete, when all entites are present: there is at least one basket, all baskets should have at least one position, and each position should have at least one discount
We will keep the data in four tables as following:
Table creation details
Let's try to find all customers that have at least one incomplete order.
An incomplete order means that either there are no basket, or there are no positions in the basket, or there are no discounts for the position.
In our sample data, customer 1 has no baskets, customer 2 has no positions, and customer 3 has no discounts.
Having no baskets or no positions in a basket naturally also means not having any discounts. That's why we can utilize an outer join to find the missing rows, and filter out the discounts with a NULL id
:
SELECT DISTINCT cmr_id FROM t_customer LEFT OUTER JOIN t_basket ON bsk_customer = cmr_id LEFT OUTER JOIN t_position ON pos_basket = bsk_id LEFT OUTER JOIN t_discount ON dsc_position = pos_id WHERE dsc_id IS NULL
cmr_id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0,0012s (3,2659s) |
This runs for more than 3 seconds.
The query joins all four tables, selects all rows from them and finally filters out the rows that have a NULL dsc_id
.
We have more than 100,000 rows in the largest tables, that's why these joins are quite costly.
We can reduce scanning: just find a first matching row for each customer that has a NULL dsc_id
and then go to another customer instead of scanning all other baskets, positions and discounts for this one.
This may be achieved by using NOT EXISTS
:
SELECT cmr_id FROM t_customer WHERE NOT EXISTS ( SELECT 1 FROM t_basket, t_position, t_discount WHERE bsk_customer = cmr_id AND pos_basket = bsk_id AND dsc_position = pos_id )
cmr_id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 rows fetched in 0,0012s (0,1618s) |
This is much faster.
Now let's change the rules a little. We will consider an order complete only if it has at least two discounts for each position. In our sample data, customer 4 has exactly 1 discount for each position.
This becomes complicated now. We cannot use NOT EXISTS
anymore, as it will fail if exactly one discount exists, which is not what we need.
We can of course filter out the positions having less than two discounts using the following query:
SELECT cmr_id FROM t_customer WHERE NOT EXISTS ( SELECT 1 FROM t_basket LEFT OUTER JOIN t_position ON pos_basket = bsk_id LEFT OUTER JOIN t_discount ON dsc_position = pos_id WHERE bsk_customer = cmr_id GROUP BY bsk_id, pos_id HAVING COUNT(*) > 1 )
cmr_id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 rows fetched in 0,0012s (5,6372s) |
, which will give us the result we're after. But this query is quite slow.
How can we optimize it?
MySQL uses NESTED LOOPS
to perform joins, with t_customer
as a leading table. To calculate a COUNT(*)
, it first need to join all four tables, and then filter out the rows that have COUNT(*) >= 2
.
But t_discount
actually has all the information we need for filtering. We may first perform the filtering, and then join the other tables with the filtered results. Thus we can get rid of the unneeded joins.
So we'll make our query perform the following steps:
- We find all customers that don't have discounts at all. These customers don't have any records in the t_discount, but, fortunately, we already know how to filter them out fast
- We find all customers that have positions with exactly one discount. To do this, we select all such positions from t_discount by using COUNT(*), and perform the joins with the other tables. Joins will be performed after fitering.
- Finally, we UNION the results of the queries above
Let's see how it works:
SELECT cmr_id FROM t_customer WHERE NOT EXISTS ( SELECT 1 FROM t_basket, t_position, t_discount WHERE bsk_customer = cmr_id AND pos_basket = bsk_id AND dsc_position = pos_id ) UNION SELECT DISTINCT bsk_customer FROM ( SELECT dsc_position FROM t_discount GROUP BY dsc_position HAVING COUNT(*) < 2 ) dd, t_position, t_basket WHERE pos_id = dsc_position AND bsk_id = pos_basket
cmr_id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 rows fetched in 0,0012s (0,7321s) |
Much faster.