July 8, 2009 - EXPLAIN EXTENDED at EXPLAIN EXTENDED

Archive for July 8th, 2009

This query for creating a list of сandidate duplicates is easy enough:
SELECT  COUNT(*), Can_FName, Can_HPhone, Can_EMail
FROM    can 
GROUP BY
        Can_FName, Can_HPhone, Can_EMail
HAVING  COUNT(*) > 1
But if the actual rule I want to check against is FName and (HPhone OR Email) — how can I adjust the GROUP BY to work with this?

This is quite a common need.

The problem with conditions like that is that GROUP BY will not work on them since they are not transitive.

Let's imagine a sample dataset:

id	name	phone	email
1	John	555-00-00	john@example.com
2	John	555-00-01	john@example.com
3	John	555-00-01	john-other@example.com
4	James	555-00-00	james@example.com
5	James	555-00-01	james-other@example.com

We see that records 1 and 2 are duplicates (I'd better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.

But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn't be considered similar if not for the record 2 that binds them.

Given this, we should use similarity chains instead of GROUP BY.

A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).

Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.

This set can be identified by the id of the least node.

We can solve this tasks using recursive CTE's. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

July 8th, 2009 at 11:00 pm

Posted in SQL Server

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

EXPLAIN EXTENDED

Archive for July 8th, 2009

Finding duplicates on either of the fields

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow

EXPLAIN EXTENDED

Archive for July 8th, 2009

Finding duplicates on either of the fields

Share this:

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow