Archive for November 17th, 2009
Selecting most relevant row
Comments enabled. I *really* need your comment
From Stack Overflow:
I have whole bunch of contact data that may or may not contain all info, like this:
Name Address1 Address2 Test User Address1 Address2 test@example.com NULL NULL NULL test@example.com For each email address, I would love to extract the most relevant record, that is the row with the contact information rather than the row with empty fields
To do this we need a sorting criterion: how to tell that a record is considered more relevant
than another record. We can use the number of the fields filled as a criterion.
Let's create a sample table:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE address ( id INT NOT NULL PRIMARY KEY, name VARCHAR(20), address1 VARCHAR(100), address2 VARCHAR(100), email VARCHAR(255) NOT NULL, KEY ix_address_email (email) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(20000); COMMIT; INSERT INTO address (id, name, address1, address2, email) SELECT id, CASE WHEN RAND(20091117) < 0.1 THEN CONCAT('Name ', id) ELSE NULL END, CASE WHEN RAND(20091117 << 1) < 0.1 THEN CONCAT('Address 1 ', id) ELSE NULL END, CASE WHEN RAND(20091117 << 2) < 0.1 THEN CONCAT('Address 2 ', id) ELSE NULL END, CONCAT('test', ((id - 1) % 20 + 1), '@example.com') FROM filler; [/sourcecode] </div> This table has <strong>20,000</strong> records with <strong>20</strong> distinct emails. To find out the most filled values, we need to do the following: <ol> <li>Select the list of the emails</li> <li>For each email, sort the records so that the most filled record goes first. To do this, we just cast the boolean values (like <code>name IS NULL</code>) to the integers and add them up</li> <li>Select the first record's id and join on this id</li> </ol> This guarantees that for each distinct email in the table, exactly one record will be selected (despite the actual number of records filled). And here is the query: SELECT ai.* FROM ( SELECT DISTINCT email FROM address ) ad JOIN address ai ON ai.id = ( SELECT id FROM address ao WHERE ao.email = ad.email ORDER BY (name IS NULL) + (address1 IS NULL) + (address2 IS NULL) LIMIT 1 )
id | name | address1 | address2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18390 | Name 18390 | Address 1 18390 | Address 2 18390 | test10@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3591 | Name 3591 | Address 1 3591 | test11@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2052 | Name 2052 | Address 2 2052 | test12@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
19473 | Name 19473 | Address 1 19473 | Address 2 19473 | test13@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8234 | Name 8234 | Address 1 8234 | Address 2 8234 | test14@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14355 | Name 14355 | Address 1 14355 | Address 2 14355 | test15@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6916 | Name 6916 | Address 1 6916 | Address 2 6916 | test16@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8937 | Name 8937 | Address 1 8937 | Address 2 8937 | test17@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14378 | Name 14378 | Address 1 14378 | Address 2 14378 | test18@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3099 | Address 1 3099 | Address 2 3099 | test19@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2821 | Name 2821 | Address 1 2821 | Address 2 2821 | test1@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17580 | Name 17580 | Address 1 17580 | Address 2 17580 | test20@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16902 | Name 16902 | Address 1 16902 | test2@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17703 | Name 17703 | Address 1 17703 | Address 2 17703 | test3@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4324 | Name 4324 | Address 1 4324 | Address 2 4324 | test4@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2545 | Name 2545 | Address 1 2545 | Address 2 2545 | test5@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1546 | Name 1546 | Address 1 1546 | Address 2 1546 | test6@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18707 | Name 18707 | Address 1 18707 | Address 2 18707 | test7@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12048 | Address 1 12048 | Address 2 12048 | test8@example.com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1009 | Name 1009 | Address 1 1009 | Address 2 1009 | test9@example.com | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 rows fetched in 0.0010s (0.2543s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 20 | 100.00 | |||||
1 | PRIMARY | ai | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where |
3 | DEPENDENT SUBQUERY | ao | ref | ix_address_email | ix_address_email | 767 | ad.email | 646 | 100.00 | Using where; Using filesort |
2 | DERIVED | address | range | ix_address_email | 767 | 31 | 100.00 | Using index for group-by |
Field or reference 'ad.email' of SELECT #3 was resolved in SELECT #1 select `20091117_relevant`.`ai`.`id` AS `id`,`20091117_relevant`.`ai`.`name` AS `name`,`20091117_relevant`.`ai`.`address1` AS `address1`,`20091117_relevant`.`ai`.`address2` AS `address2`,`20091117_relevant`.`ai`.`email` AS `email` from (select distinct `20091117_relevant`.`address`.`email` AS `email` from `20091117_relevant`.`address`) `ad` join `20091117_relevant`.`address` `ai` where (`20091117_relevant`.`ai`.`id` = (select `20091117_relevant`.`ao`.`id` AS `id` from `20091117_relevant`.`address` `ao` where (`20091117_relevant`.`ao`.`email` = `ad`.`email`) order by ((isnull(`20091117_relevant`.`ao`.`name`) + isnull(`20091117_relevant`.`ao`.`address1`)) + isnull(`20091117_relevant`.`ao`.`address2`)) limit 1))
As you can see, not all records are filled completely.
Nevertheless, the query selects exactly one record per email, and this is the most filled record for this email.
Written by Quassnoi
November 17th, 2009 at 11:00 pm
Posted in MySQL
Subscribe
Subscribe by email
Contacts
Should I?
Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback
Recent articles
- Happy New Year: Diffusion Model image generator in about 700 lines of pure SQL
- Happy New Year: GPT in 500 lines of SQL
- Happy New Year: solving the Rubik’s Cube in SQL
- A good first word for Wordle
- Happy New Year: quantum computer emulator in SQL
Calendar
Archives
- December 2024
- December 2023
- December 2022
- January 2022
- December 2021
- December 2020
- December 2019
- December 2018
- December 2017
- December 2016
- December 2015
- December 2014
- July 2014
- December 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- January 2013
- December 2012
- December 2011
- June 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009