EXPLAIN EXTENDED

How to create fast database queries

Finding duplicates on either of the fields

Comments enabled. I *really* need your comment

From Stack Overflow:

This query for creating a list of сandidate duplicates is easy enough:

SELECT  COUNT(*), Can_FName, Can_HPhone, Can_EMail
FROM    can 
GROUP BY
        Can_FName, Can_HPhone, Can_EMail
HAVING  COUNT(*) > 1

But if the actual rule I want to check against is FName and (HPhone OR Email) — how can I adjust the GROUP BY to work with this?

This is quite a common need.

The problem with conditions like that is that GROUP BY will not work on them since they are not transitive.

Let's imagine a sample dataset:

id name phone email
1 John 555-00-00 john@example.com
2 John 555-00-01 john@example.com
3 John 555-00-01 john-other@example.com
4 James 555-00-00 james@example.com
5 James 555-00-01 james-other@example.com

We see that records 1 and 2 are duplicates (I'd better use the term similar), since the emails match, and 2 and 3 are similar too, since their phone numbers match.

But 1 and 3 are not similar per se, since they have no match on either email or phone number. They wouldn't be considered similar if not for the record 2 that binds them.

Given this, we should use similarity chains instead of GROUP BY.

A record is considered similar to another record if the names match and a path can be build from one record to another changing either email or phone on each step (but not both).

Each similarity set (in this sense) constitutes a closed group: if node 1 can be reached from node 2, it can be reached from any node reachable from node 2 and vice versa: if node 1 cannot be reached from node 2, it also cannot be reached from any node reachable from 2.

This set can be identified by the id of the least node.

We can solve this tasks using recursive CTE's. This of course will not work efficiently on large datasets or datasets having large similarity chains, but will be allright for simple cases.

Let's create a sample table:

CREATE SCHEMA [20090708_duplicate]
CREATE TABLE t_candidate (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(50) NOT NULL,
        phone VARCHAR(50) NOT NULL,
        email VARCHAR(50) NOT NULL
)
GO
INSERT
INTO    [20090708_duplicate].t_candidate
SELECT  1 AS id, 'John' AS name, '555-00-00' AS phone, 'john@example.com' AS email
UNION ALL
SELECT  2 AS id, 'John', '555-00-01', 'john@example.com'
UNION ALL
SELECT  3 AS id, 'John', '555-00-01', 'john-other@example.com'
UNION ALL
SELECT  4 AS id, 'James', '555-00-00', 'james@example.com'
UNION ALL
SELECT  5 AS id, 'James', '555-00-01', 'james-other@example.com'
GO

This script fills the table with the sample data from the above example.

To build a similarity chain we should construct a CTE that iterates the records, joining the previous result on the OR condition we need.

To avoid loops, we make sure that each next node has id greater than the previous one:

WITH    dup AS (
        SELECT  id AS chainid, id, name, phone, email, 1 as d
        FROM    [20090708_duplicate].t_candidate q
        UNION ALL
        SELECT  chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
        FROM    dup
        JOIN    [20090708_duplicate].t_candidate qo
        ON      qo.name = dup.name
                AND (qo.phone = dup.phone OR qo.email = dup.email)
                AND qo.id > dup.id
        )
SELECT  *
FROM    dup
ORDER BY
        chainid, id
chainid id name phone email d
1 1 John 555-00-00 john@example.com 1
1 2 John 555-00-01 john@example.com 2
1 3 John 555-00-01 john-other@example.com 3
2 2 John 555-00-01 john@example.com 1
2 3 John 555-00-01 john-other@example.com 2
3 3 John 555-00-01 john-other@example.com 1
4 4 James 555-00-00 james@example.com 1
5 5 James 555-00-01 james-other@example.com 1

We have 5 chains here, starting from each of 5 records. Chains 4 and 5 are not met in any other chain but in their own, which means they are not similar to any other node.

However, chains starting from 2 and 3 are subsets of chain 1 and should be filtered out.

Since id's in any chain come in increasing order, the chain with least chainid will always be the longest. That means we should filter out all chains whose chainid is met among id's of some other chain:

WITH    dup AS (
        SELECT  id AS chainid, id, name, phone, email, 1 as d
        FROM    [20090708_duplicate].t_candidate q
        UNION ALL
        SELECT  chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
        FROM    dup
        JOIN    [20090708_duplicate].t_candidate qo
        ON      qo.name = dup.name
                AND (qo.phone = dup.phone OR qo.email = dup.email)
                AND qo.id > dup.id
        ),
        chains AS 
        (
        SELECT  *
        FROM    dup do
        WHERE   chainid NOT IN
                (
                SELECT  id
                FROM    dup di
                WHERE   di.chainid < do.chainid
                )
        )
SELECT  *
FROM    chains
ORDER BY
        chainid

&#91;/sourcecode&#93;

<div class="terminal">
<table class="terminal">
<tr>
<th>chainid</th><th>id</th><th>name</th><th>phone</th><th>email</th><th>d</th></tr>
<tr><td class="int">1</td><td class="int">1</td><td class="varchar">John</td><td class="varchar">555-00-00</td><td class="varchar">john@example.com</td><td class="int">1</td></tr>
<tr><td class="int">1</td><td class="int">2</td><td class="varchar">John</td><td class="varchar">555-00-01</td><td class="varchar">john@example.com</td><td class="int">2</td></tr>
<tr><td class="int">1</td><td class="int">3</td><td class="varchar">John</td><td class="varchar">555-00-01</td><td class="varchar">john-other@example.com</td><td class="int">3</td></tr>
<tr><td class="int">4</td><td class="int">4</td><td class="varchar">James</td><td class="varchar">555-00-00</td><td class="varchar">james@example.com</td><td class="int">1</td></tr>
<tr><td class="int">5</td><td class="int">5</td><td class="varchar">James</td><td class="varchar">555-00-01</td><td class="varchar">james-other@example.com</td><td class="int">1</td></tr>
</table>
</div>

Now all that is left to do is to group the values by <code>chainid</code>, calculate the count of similarities and filter out the rows that have none:


WITH    dup AS (
        SELECT  id AS chainid, id, name, phone, email, 1 as d
        FROM    [20090708_duplicate].t_candidate q
        UNION ALL
        SELECT  chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
        FROM    dup
        JOIN    [20090708_duplicate].t_candidate qo
        ON      qo.name = dup.name
                AND (qo.phone = dup.phone OR qo.email = dup.email)
                AND qo.id > dup.id
        ),
        chains AS 
        (
        SELECT  *
        FROM    dup do
        WHERE   chainid NOT IN
                (
                SELECT  id
                FROM    dup di
                WHERE   di.chainid < do.chainid
                )
        ),
        nc AS
        (
        SELECT  chainid, COUNT(*) AS cnt
        FROM    chains
        GROUP BY
                chainid
        HAVING  COUNT(*) > 1
        )
SELECT  q.*, nc.cnt
FROM    nc
JOIN    [20090708_duplicate].t_candidate q
ON      q.id = nc.chainid
ORDER BY
        q.id

id name phone email cnt
1 John 555-00-00 john@example.com 3

We have selected the first record in the similarity chain along with the total count of records in the chain.

Written by Quassnoi

July 8th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply