Selecting DISTINCT pairs not having a common index at EXPLAIN EXTENDED

I have this query:
SELECT  DISTINCT id, name
FROM    table1
For a given id, the name will always be the same.

Both fields are indexed. There's no separate table that maps the id to the name.

The table is very large (10,000,000 rows), so the query could take some time.

This query is very fast, since it's indexed:
SELECT  DISTINCT id
FROM    table1
Likewise for this query:
SELECT  DISTINCT id
FROM    table1
Assuming I can't get the database structure changed (a very safe assumption), what's a better way to structure the first query for performance?

The fastest way would of course be building an index on (id, name).

But since we (for whatever reason) cannot change the database structure, it's not an option.

Author, nevertheless, mentioned that for any given id, the name will always be the same. This can help us.

Let's create a sample table:

CREATE TABLE t_distinct (
        id NUMBER(10) NOT NULL,
        name VARCHAR2(50) NOT NULL,
        stuffing VARCHAR2(400) NOT NULL
)
/
INSERT
INTO    t_distinct (id, name, stuffing)
SELECT  FLOOR((level - 1) / 30000) + 1,
        'Name ' || (FLOOR((level - 1) / 30000) + 1),
        LPAD(' ', 400, ' ')
FROM    dual
CONNECT BY
        level <= 1000000
/
COMMIT
/
CREATE INDEX ix_distinct_id ON t_distinct (id)
/
CREATE INDEX ix_distinct_name ON t_distinct (name)
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090602_distinct"');
END;
/
&#91;/sourcecode&#93;

This table has <strong>1,000,000</strong> rows and <strong>34</strong> <code>DISTINCT id, name</code> pairs.

The table also has a stuffing column filled with spaces.

Author mentioned that there are lots of other columns in his table, and this column makes our table to occupy about as much space as his table does.

The original query:


SELECT  DISTINCT id, name
FROM    t_distinct

ID	NAME
1	Name 1
16	Name 16
2	Name 2
25	Name 25
26	Name 26
28	Name 28
6	Name 6
13	Name 13
23	Name 23
32	Name 32
10	Name 10
11	Name 11
14	Name 14
22	Name 22
24	Name 24
17	Name 17
12	Name 12
15	Name 15
19	Name 19
27	Name 27
4	Name 4
5	Name 5
9	Name 9
21	Name 21
3	Name 3
30	Name 30
34	Name 34
18	Name 18
20	Name 20
33	Name 33
29	Name 29
31	Name 31
7	Name 7
8	Name 8
34 rows fetched in 0.0009s (2.5312s)

SELECT STATEMENT 
 HASH UNIQUE
  VIEW , 20090602_distinct.index$_join$_001
   HASH JOIN 
    INDEX FAST FULL SCAN, 20090602_distinct.IX_DISTINCT_ID
    INDEX FAST FULL SCAN, 20090602_distinct.IX_DISTINCT_NAME

runs for more than 2 seconds.

This is because two indexes are being JOIN'ed on ROWID to obtain both values. This is quite a long operation.

What can we do to improve it?

Author mentioned that for any given id, there is exactly one name.

This means that given an id we can select any name we want, and there is no need to browse all table records.

First name for an id (in no particular order) can be selected with a subquery:

SELECT  name
FROM    t_distinct
WHERE   id = :id
        AND rownum = 1

This will use an index on id.

Now everything we need to do is to select DISTINCT id's (which is efficient) and use this subquery for each of these id's:

SELECT  did,
        (
        SELECT  name
        FROM    t_distinct di
        WHERE   id = did
                AND rownum = 1
        ) AS name
FROM    (
        SELECT  /*+ NO_MERGE */
                DISTINCT id AS did
        FROM    t_distinct d
        )

DID	NAME
1	Name 1
22	Name 22
25	Name 25
30	Name 30
34	Name 34
6	Name 6
11	Name 11
13	Name 13
28	Name 28
29	Name 29
2	Name 2
14	Name 14
20	Name 20
21	Name 21
26	Name 26
31	Name 31
4	Name 4
5	Name 5
24	Name 24
32	Name 32
8	Name 8
17	Name 17
23	Name 23
33	Name 33
3	Name 3
7	Name 7
18	Name 18
27	Name 27
9	Name 9
10	Name 10
12	Name 12
15	Name 15
16	Name 16
19	Name 19
34 rows fetched in 0.0009s (0.2204s)

SELECT STATEMENT 
 COUNT STOPKEY
  TABLE ACCESS BY INDEX ROWID, 20090602_distinct.T_DISTINCT
   INDEX RANGE SCAN, 20090602_distinct.IX_DISTINCT_ID
 VIEW 
  HASH UNIQUE
   INDEX FAST FULL SCAN, 20090602_distinct.IX_DISTINCT_ID

This is much faster, since everything can be done using the index on id.

The subquery is executed once per DISTINCT id. Since the subquery requires only one index scan and there are few DISTINCT id's, the subqueries conribute very little into overall query time.

Written by Quassnoi

June 2nd, 2009 at 11:00 pm

Posted in Oracle

EXPLAIN EXTENDED

Distinct pairs

Leave a Reply

Subscribe

Subscribe by email

Contacts

Should I?

Recent articles

Calendar

Archives

Categories

Stack Overflow