Counting skills

Ok, i've been trying to solve this for about 2 hours now... Please advise:

Tables:

PROFILE [id (int), name (varchar), ...] SKILL [id (int), id_profile (int), id_app (int), lvl (int), ...] APP [id (int), ...]

The lvl can basically go from 0 to 3.

I'm trying to get this particular stat: "What is the percentage of apps that is covered by at least two people having a skill of 2 or higher?"

Thanks a lot

We actually need to count ratio here:

(apps that have 2 or more skills with level 2+) / (total number of apps)

This is one of not so common cases when AVG aggregate apparently comes handy.

What we need here, is:

Assign a 1 to apps that match the criteria above,
Assign a 0 to apps that don't match these, and
Take the average.

First thought, of course, is to COUNT(*) all lvl 2+ skills for each app and filter them out:

SELECT  AVG(covered)
FROM    (
        SELECT  CASE WHEN COUNT(*) >= 2 THEN 1 ELSE 0 END AS covered
        FROM    app a
        LEFT JOIN
                skill s
        ON      (s.id_app = a.id AND s.lvl >= 2)
        GROUP BY
                a.id
        ) ao

, which will work all right.

But this query may be not so performant in certain cases. Let's create the tables and fill them with sample data:

Database creation scripts

DROP TABLE IF EXISTS
 `20090228_percentage`.`app`;
CREATE TABLE
  `20090228_percentage`.`app` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS
  `20090228_percentage`.`skill`;
CREATE TABLE
  `20090228_percentage`.`skill` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `id_profile` int(11) NOT NULL,
    `id_app` int(11) NOT NULL,
    `lvl` tinyint(4) NOT NULL,
    PRIMARY KEY (`id`),
    KEY `ix_skill_app_lvl` (`id_app`,`lvl`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

DROP PROCEDURE IF EXISTS `20090228_percentage`.`prc_fill_app`$$
CREATE DEFINER=`abolen`@`%` PROCEDURE  `20090228_percentage`.`prc_fill_app`(rc INT)
BEGIN
  DECLARE i INTEGER;
    DECLARE EXIT HANDLER FOR SQLEXCEPTION
  BEGIN
     ROLLBACK;
  END;
  START TRANSACTION;
  SET i = 0;
  WHILE i < rc DO
    SET i = i + 1;
    INSERT
    INTO  app (name)
    VALUES (CONCAT('app ', i));
  END WHILE;
  COMMIT;
END $$

DELIMITER ;

CALL prc_fill_app(10000);

INSERT
INTO   skill (id_profile, id_app, lvl)
SELECT *
FROM
  (
  SELECT 1, a.id, 3 - TRUNCATE(LOG10(RAND(20090228) * 999 + 1), 0) AS lvl
  FROM
    (
    SELECT id
    FROM   app
    WHERE  id <= 10
    ) ai, app a
  ) ao;

INSERT
INTO   skill (id_profile, id_app, lvl)
SELECT 2, 1, 3
FROM (
     SELECT  1
     FROM    app
     LIMIT   1000
     ) ai,
     app;
&#91;/sourcecode&#93;
</div>

Now we have <strong>10 000</strong> rows in apps and <strong>10 100 000</strong> rows in skills.

Let's see how they are distributed:


SELECT  id_profile, lvl, COUNT(*)
FROM    skill
GROUP BY
        id_profile, lvl

id_profile	lvl	COUNT(*)
1	1	90129
1	2	8969
1	3	902
2	3	10000000

When filling the tables, I used different profile_id's in different queries to separate the data. And if we talk of skills, it's natural to call there profiles common trades and politics.

In common trades, there are exactly 10 people who master a certain trade. Their skills are distributed as shown in the table above: 90% jerks, 9% profis and 1% of genii.

In politics, there is only one science: how to run a country. Of course there are 10 000 000 people (generally taxi drivers and stylists) who know exactly how to do it.

Let's run our initial query and see how it performs:

SELECT  AVG(covered)
FROM    (
        SELECT  CASE WHEN COUNT(*) >= 2 THEN 1 ELSE 0 END AS covered
        FROM    app a
        LEFT JOIN
                skill s
        ON      (s.id_app = a.id AND s.lvl >= 2)
        GROUP BY
                a.id
        ) ao

AVG(covered)
0.2566
1 row fetched in 0,0007s (9,2047s)

As you can see, it works for almost 10 seconds.

Why? Because for id_app of 1, it needs to COUNT(*) over more than 10 000 000 rows in skills.

But we don't actually need to count all these millions to know they are greater than 2. We just need to make sure there are at least two rows, that's all.

And we can achieve this with LIMIT 1, 1 in a correlated subquery in SELECT list. It will return as soon as it encounters the second row that matches our criteria. And if there is no second row, it will just return NULL:

SELECT  AVG(covered)
FROM    (
        SELECT IFNULL
               (
               (
               SELECT  1
               FROM    skill s
               WHERE   s.id_app = a.id
                       AND s.lvl >= 2
               LIMIT 1, 1
               ), 0
               ) AS covered
        FROM  app a
        ) ao

AVG(covered)
0.2566
1 row fetched in 0,0006s (0,1244s)

Almost 100 times increase in performance.

Written by Quassnoi

February 28th, 2009 at 9:50 pm

Posted in MySQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

EXPLAIN EXTENDED

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Counting skills

Share this:

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow