From Stack Overflow:
Ok, i've been trying to solve this for about 2 hours now... Please advise:
Tables:
PROFILE [id (int), name (varchar), ...]
SKILL [id (int), id_profile (int), id_app (int), lvl (int), ...]
APP [id (int), ...]
The lvl can basically go from 0 to 3.
I'm trying to get this particular stat: "What is the percentage of apps that is covered by at least two people having a skill of 2 or higher?"
Thanks a lot
We actually need to count ratio here:
(apps that have 2 or more skills with level 2+) / (total number of apps)
This is one of not so common cases when AVG
aggregate apparently comes handy.
What we need here, is:
- Assign a 1 to apps that match the criteria above,
- Assign a 0 to apps that don't match these, and
- Take the average.
First thought, of course, is to COUNT(*)
all lvl 2+
skills for each app
and filter them out:
SELECT AVG(covered)
FROM (
SELECT CASE WHEN COUNT(*) >= 2 THEN 1 ELSE 0 END AS covered
FROM app a
LEFT JOIN
skill s
ON (s.id_app = a.id AND s.lvl >= 2)
GROUP BY
a.id
) ao
, which will work all right.
But this query may be not so performant in certain cases. Let's create the tables and fill them with sample data:
Database creation scripts
DROP TABLE IF EXISTS
`20090228_percentage`.`app`;
CREATE TABLE
`20090228_percentage`.`app` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DROP TABLE IF EXISTS
`20090228_percentage`.`skill`;
CREATE TABLE
`20090228_percentage`.`skill` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_profile` int(11) NOT NULL,
`id_app` int(11) NOT NULL,
`lvl` tinyint(4) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_skill_app_lvl` (`id_app`,`lvl`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DELIMITER $$
DROP PROCEDURE IF EXISTS `20090228_percentage`.`prc_fill_app`$$
CREATE DEFINER=`abolen`@`%` PROCEDURE `20090228_percentage`.`prc_fill_app`(rc INT)
BEGIN
DECLARE i INTEGER;
DECLARE EXIT HANDLER FOR SQLEXCEPTION
BEGIN
ROLLBACK;
END;
START TRANSACTION;
SET i = 0;
WHILE i < rc DO
SET i = i + 1;
INSERT
INTO app (name)
VALUES (CONCAT('app ', i));
END WHILE;
COMMIT;
END $$
DELIMITER ;
CALL prc_fill_app(10000);
INSERT
INTO skill (id_profile, id_app, lvl)
SELECT *
FROM
(
SELECT 1, a.id, 3 - TRUNCATE(LOG10(RAND(20090228) * 999 + 1), 0) AS lvl
FROM
(
SELECT id
FROM app
WHERE id <= 10
) ai, app a
) ao;
INSERT
INTO skill (id_profile, id_app, lvl)
SELECT 2, 1, 3
FROM (
SELECT 1
FROM app
LIMIT 1000
) ai,
app;
[/sourcecode]
</div>
Now we have <strong>10 000</strong> rows in apps and <strong>10 100 000</strong> rows in skills.
Let's see how they are distributed:
SELECT id_profile, lvl, COUNT(*)
FROM skill
GROUP BY
id_profile, lvl
id_profile |
lvl |
COUNT(*) |
1 |
1 |
90129 |
1 |
2 |
8969 |
1 |
3 |
902 |
2 |
3 |
10000000 |
When filling the tables, I used different profile_id
's in different queries to separate the data. And if we talk of skills, it's natural to call there profiles common trades and politics.
In common trades, there are exactly 10 people who master a certain trade. Their skills are distributed as shown in the table above: 90% jerks, 9% profis and 1% of genii.
In politics, there is only one science: how to run a country. Of course there are 10 000 000 people (generally taxi drivers and stylists) who know exactly how to do it.
Let's run our initial query and see how it performs:
SELECT AVG(covered)
FROM (
SELECT CASE WHEN COUNT(*) >= 2 THEN 1 ELSE 0 END AS covered
FROM app a
LEFT JOIN
skill s
ON (s.id_app = a.id AND s.lvl >= 2)
GROUP BY
a.id
) ao
AVG(covered) |
0.2566 |
1 row fetched in 0,0007s (9,2047s) |
As you can see, it works for almost 10 seconds.
Why? Because for id_app
of 1, it needs to COUNT(*)
over more than 10 000 000 rows in skills
.
But we don't actually need to count all these millions to know they are greater than 2. We just need to make sure there are at least two rows, that's all.
And we can achieve this with LIMIT 1, 1
in a correlated subquery in SELECT
list. It will return as soon as it encounters the second row that matches our criteria. And if there is no second row, it will just return NULL
:
SELECT AVG(covered)
FROM (
SELECT IFNULL
(
(
SELECT 1
FROM skill s
WHERE s.id_app = a.id
AND s.lvl >= 2
LIMIT 1, 1
), 0
) AS covered
FROM app a
) ao
AVG(covered) |
0.2566 |
1 row fetched in 0,0006s (0,1244s) |
Almost 100 times increase in performance.
Leave a Reply