EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: Things SQL needs: determining range cardinality. You're welcome to read and comment on it.

Counting skills

From Stack Overflow:

Ok, i’ve been trying to solve this for about 2 hours now… Please advise:

Tables:


PROFILE [id (int), name (varchar), ...]
SKILL [id (int), id_profile (int), id_app (int), lvl (int), ...]
APP [id (int), ...]

The lvl can basically go from 0 to 3.

I’m trying to get this particular stat: “What is the percentage of apps that is covered by at least two people having a skill of 2 or higher?”

Thanks a lot

We actually need to count ratio here:

(apps that have 2 or more skills with level 2+) / (total number of apps)

This is one of not so common cases when AVG aggregate apparently comes handy.

What we need here, is:

  1. Assign a 1 to apps that match the criteria above,
  2. Assign a 0 to apps that don’t match these, and
  3. Take the average.

First thought, of course, is to COUNT(*) all lvl 2+ skills for each app and filter them out:

SELECT  AVG(covered)
FROM    (
        SELECT  CASE WHEN COUNT(*) >= 2 THEN 1 ELSE 0 END AS covered
        FROM    app a
        LEFT JOIN
                skill s
        ON      (s.id_app = a.id AND s.lvl >= 2)
        GROUP BY
                a.id
        ) ao

, which will work all right.

But this query may be not so performant in certain cases. Let’s create the tables and fill them with sample data:

Database creation scripts

Now we have 10 000 rows in apps and 10 100 000 rows in skills.

Let’s see how they are distributed:

SELECT  id_profile, lvl, COUNT(*)
FROM    skill
GROUP BY
        id_profile, lvl
id_profile lvl COUNT(*)
1 1 90129
1 2 8969
1 3 902
2 3 10000000

When filling the tables, I used different profile_id‘s in different queries to separate the data. And if we talk of skills, it’s natural to call there profiles common trades and politics.

In common trades, there are exactly 10 people who master a certain trade. Their skills are distributed as shown in the table above: 90% jerks, 9% profis and 1% of genii.

In politics, there is only one science: how to run a country. Of course there are 10 000 000 people (generally taxi drivers and stylists) who know exactly how to do it.

Let’s run our initial query and see how it performs:

SELECT  AVG(covered)
FROM    (
        SELECT  CASE WHEN COUNT(*) >= 2 THEN 1 ELSE 0 END AS covered
        FROM    app a
        LEFT JOIN
                skill s
        ON      (s.id_app = a.id AND s.lvl >= 2)
        GROUP BY
                a.id
        ) ao
AVG(covered)
0.2566
1 row fetched in 0,0007s (9,2047s)

As you can see, it works for almost 10 seconds.

Why? Because for id_app of 1, it needs to COUNT(*) over more than 10 000 000 rows in skills.

But we don’t actually need to count all these millions to know they are greater than 2. We just need to make sure there are at least two rows, that’s all.

And we can achieve this with LIMIT 1, 1 in a correlated subquery in SELECT list. It will return as soon as it encounters the second row that matches our criteria. And if there is no second row, it will just return NULL:

SELECT  AVG(covered)
FROM    (
        SELECT IFNULL
               (
               (
               SELECT  1
               FROM    skill s
               WHERE   s.id_app = a.id
                       AND s.lvl >= 2
               LIMIT 1, 1
               ), 0
               ) AS covered
        FROM  app a
        ) ao
AVG(covered)
0.2566
1 row fetched in 0,0006s (0,1244s)

Almost 100 times increase in performance.

Written by Quassnoi

February 28th, 2009 at 9:50 pm

Posted in MySQL

Comments are closed.