EXPLAIN EXTENDED

How to create fast database queries

Distinct pairs

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query:

SELECT  DISTINCT id, name
FROM    table1

For a given id, the name will always be the same.

Both fields are indexed. There's no separate table that maps the id to the name.

The table is very large (10,000,000 rows), so the query could take some time.

This query is very fast, since it's indexed:

SELECT  DISTINCT id
FROM    table1

Likewise for this query:

SELECT  DISTINCT id
FROM    table1

Assuming I can't get the database structure changed (a very safe assumption), what's a better way to structure the first query for performance?

The fastest way would of course be building an index on (id, name).

But since we (for whatever reason) cannot change the database structure, it's not an option.

Author, nevertheless, mentioned that for any given id, the name will always be the same. This can help us.

Let's create a sample table:

CREATE TABLE t_distinct (
        id NUMBER(10) NOT NULL,
        name VARCHAR2(50) NOT NULL,
        stuffing VARCHAR2(400) NOT NULL
)
/
INSERT
INTO    t_distinct (id, name, stuffing)
SELECT  FLOOR((level - 1) / 30000) + 1,
        'Name ' || (FLOOR((level - 1) / 30000) + 1),
        LPAD(' ', 400, ' ')
FROM    dual
CONNECT BY
        level <= 1000000
/
COMMIT
/
CREATE INDEX ix_distinct_id ON t_distinct (id)
/
CREATE INDEX ix_distinct_name ON t_distinct (name)
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090602_distinct"');
END;
/
&#91;/sourcecode&#93;

This table has <strong>1,000,000</strong> rows and <strong>34</strong> <code>DISTINCT id, name</code> pairs.

The table also has a stuffing column filled with spaces.

Author mentioned that there are lots of other columns in his table, and this column makes our table to occupy about as much space as his table does.

The original query:


SELECT  DISTINCT id, name
FROM    t_distinct

ID NAME
1 Name 1
16 Name 16
2 Name 2
25 Name 25
26 Name 26
28 Name 28
6 Name 6
13 Name 13
23 Name 23
32 Name 32
10 Name 10
11 Name 11
14 Name 14
22 Name 22
24 Name 24
17 Name 17
12 Name 12
15 Name 15
19 Name 19
27 Name 27
4 Name 4
5 Name 5
9 Name 9
21 Name 21
3 Name 3
30 Name 30
34 Name 34
18 Name 18
20 Name 20
33 Name 33
29 Name 29
31 Name 31
7 Name 7
8 Name 8
34 rows fetched in 0.0009s (2.5312s)
SELECT STATEMENT 
 HASH UNIQUE
  VIEW , 20090602_distinct.index$_join$_001
   HASH JOIN 
    INDEX FAST FULL SCAN, 20090602_distinct.IX_DISTINCT_ID
    INDEX FAST FULL SCAN, 20090602_distinct.IX_DISTINCT_NAME

runs for more than 2 seconds.

This is because two indexes are being JOIN'ed on ROWID to obtain both values. This is quite a long operation.

What can we do to improve it?

Author mentioned that for any given id, there is exactly one name.

This means that given an id we can select any name we want, and there is no need to browse all table records.

First name for an id (in no particular order) can be selected with a subquery:

SELECT  name
FROM    t_distinct
WHERE   id = :id
        AND rownum = 1

This will use an index on id.

Now everything we need to do is to select DISTINCT id's (which is efficient) and use this subquery for each of these id's:

SELECT  did,
        (
        SELECT  name
        FROM    t_distinct di
        WHERE   id = did
                AND rownum = 1
        ) AS name
FROM    (
        SELECT  /*+ NO_MERGE */
                DISTINCT id AS did
        FROM    t_distinct d
        )

DID NAME
1 Name 1
22 Name 22
25 Name 25
30 Name 30
34 Name 34
6 Name 6
11 Name 11
13 Name 13
28 Name 28
29 Name 29
2 Name 2
14 Name 14
20 Name 20
21 Name 21
26 Name 26
31 Name 31
4 Name 4
5 Name 5
24 Name 24
32 Name 32
8 Name 8
17 Name 17
23 Name 23
33 Name 33
3 Name 3
7 Name 7
18 Name 18
27 Name 27
9 Name 9
10 Name 10
12 Name 12
15 Name 15
16 Name 16
19 Name 19
34 rows fetched in 0.0009s (0.2204s)
SELECT STATEMENT 
 COUNT STOPKEY
  TABLE ACCESS BY INDEX ROWID, 20090602_distinct.T_DISTINCT
   INDEX RANGE SCAN, 20090602_distinct.IX_DISTINCT_ID
 VIEW 
  HASH UNIQUE
   INDEX FAST FULL SCAN, 20090602_distinct.IX_DISTINCT_ID

This is much faster, since everything can be done using the index on id.

The subquery is executed once per DISTINCT id. Since the subquery requires only one index scan and there are few DISTINCT id's, the subqueries conribute very little into overall query time.

Written by Quassnoi

June 2nd, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply