EXPLAIN EXTENDED

How to create fast database queries

Archive for November 17th, 2009

Selecting most relevant row

Comments enabled. I *really* need your comment

From Stack Overflow:

I have whole bunch of contact data that may or may not contain all info, like this:

Name Address1 Address2 Email
Test User Address1 Address2 test@example.com
NULL NULL NULL test@example.com

For each email address, I would love to extract the most relevant record, that is the row with the contact information rather than the row with empty fields

To do this we need a sorting criterion: how to tell that a record is considered more relevant than another record. We can use the number of the fields filled as a criterion.

Let's create a sample table:

Table creation details

This table has 20,000 records with 20 distinct emails.

To find out the most filled values, we need to do the following:

  1. Select the list of the emails
  2. For each email, sort the records so that the most filled record goes first. To do this, we just cast the boolean values (like name IS NULL) to the integers and add them up
  3. Select the first record's id and join on this id

This guarantees that for each distinct email in the table, exactly one record will be selected (despite the actual number of records filled).

And here is the query:

SELECT  ai.*
FROM    (
        SELECT  DISTINCT  email
        FROM    address
        ) ad
JOIN    address ai
ON      ai.id =
        (
        SELECT  id
        FROM    address ao
        WHERE   ao.email = ad.email
        ORDER BY
                (name IS NULL) + (address1 IS NULL) + (address2 IS NULL)
        LIMIT 1
        )

id name address1 address2 email
18390 Name 18390 Address 1 18390 Address 2 18390 test10@example.com
3591 Name 3591 Address 1 3591 test11@example.com
2052 Name 2052 Address 2 2052 test12@example.com
19473 Name 19473 Address 1 19473 Address 2 19473 test13@example.com
8234 Name 8234 Address 1 8234 Address 2 8234 test14@example.com
14355 Name 14355 Address 1 14355 Address 2 14355 test15@example.com
6916 Name 6916 Address 1 6916 Address 2 6916 test16@example.com
8937 Name 8937 Address 1 8937 Address 2 8937 test17@example.com
14378 Name 14378 Address 1 14378 Address 2 14378 test18@example.com
3099 Address 1 3099 Address 2 3099 test19@example.com
2821 Name 2821 Address 1 2821 Address 2 2821 test1@example.com
17580 Name 17580 Address 1 17580 Address 2 17580 test20@example.com
16902 Name 16902 Address 1 16902 test2@example.com
17703 Name 17703 Address 1 17703 Address 2 17703 test3@example.com
4324 Name 4324 Address 1 4324 Address 2 4324 test4@example.com
2545 Name 2545 Address 1 2545 Address 2 2545 test5@example.com
1546 Name 1546 Address 1 1546 Address 2 1546 test6@example.com
18707 Name 18707 Address 1 18707 Address 2 18707 test7@example.com
12048 Address 1 12048 Address 2 12048 test8@example.com
1009 Name 1009 Address 1 1009 Address 2 1009 test9@example.com
20 rows fetched in 0.0010s (0.2543s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 20 100.00
1 PRIMARY ai eq_ref PRIMARY PRIMARY 4 func 1 100.00 Using where
3 DEPENDENT SUBQUERY ao ref ix_address_email ix_address_email 767 ad.email 646 100.00 Using where; Using filesort
2 DERIVED address range ix_address_email 767 31 100.00 Using index for group-by
Field or reference 'ad.email' of SELECT #3 was resolved in SELECT #1
select `20091117_relevant`.`ai`.`id` AS `id`,`20091117_relevant`.`ai`.`name` AS `name`,`20091117_relevant`.`ai`.`address1` AS `address1`,`20091117_relevant`.`ai`.`address2` AS `address2`,`20091117_relevant`.`ai`.`email` AS `email` from (select distinct `20091117_relevant`.`address`.`email` AS `email` from `20091117_relevant`.`address`) `ad` join `20091117_relevant`.`address` `ai` where (`20091117_relevant`.`ai`.`id` = (select `20091117_relevant`.`ao`.`id` AS `id` from `20091117_relevant`.`address` `ao` where (`20091117_relevant`.`ao`.`email` = `ad`.`email`) order by ((isnull(`20091117_relevant`.`ao`.`name`) + isnull(`20091117_relevant`.`ao`.`address1`)) + isnull(`20091117_relevant`.`ao`.`address2`)) limit 1))

As you can see, not all records are filled completely.

Nevertheless, the query selects exactly one record per email, and this is the most filled record for this email.

Written by Quassnoi

November 17th, 2009 at 11:00 pm

Posted in MySQL