EXPLAIN EXTENDED

How to create fast database queries

Indexing a link table

Comments enabled. I *really* need your comment

From Stack Overflow:

A relation table is the common solution to representing a many-to-many (m:n) relationship.

In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:

CREATE TABLE t_a (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_b (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_ab (
a INT NOT NULL,
b INT NOT NULL
)

How should it be indexed to provide optimal performance in every JOIN situation?

I'll consider SQL Server as the target platform this time.

We need all WHERE, ORDER BY and GROUP BY conditions to be covered by the indexes.

On a first thought, we need to have indexes for all columns in all orders:

CREATE SCHEMA [20090512_link]
CREATE TABLE t_a (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_b (
id INT NOT NULL PRIMARY KEY,
value VARCHAR(50) NOT NULL
)

CREATE TABLE t_ab (
a INT NOT NULL,
b INT NOT NULL,
CONSTRAINT PK_ab PRIMARY KEY (a, b)
)
GO
CREATE INDEX UX_ab_a_bdesc ON [20090512_link].t_ab (a, b DESC)
CREATE INDEX IX_b ON [20090512_link].t_ab (b)
CREATE INDEX IX_bdesc ON [20090512_link].t_ab (b DESC)
GO
DECLARE @cnt INT;
SELECT  RAND(20090512)
BEGIN TRANSACTION
SET @cnt = 1;
WHILE @cnt <= 1000
        INSERT
        INTO    [20090512_link].t_a (id, value)
        VALUES  (@cnt, 'Value A ' + CAST(@cnt AS VARCHAR))
END
SET @cnt = 1;
WHILE @cnt <= 1000
        INSERT
        INTO    [20090512_link].t_b (id, value)
        VALUES  (@cnt, 'Value B ' + CAST(@cnt AS VARCHAR))
END

INSERT
INTO    [20090512_link].t_ab (a, b)
SELECT  a.id, b.id
FROM    [20090512_link].t_a a, [20090512_link].t_b b
WHERE   RAND() <= 0.3

COMMIT
GO

We have 1,000 rows in both t_a and t_b, and about 300,000 random links between them, so that each row in t_a is linked to approximately 300 rows in t_b and vice versa.

All the queries below are covered by appropriate indexes and are instant:

SELECT  TOP 10 *
FROM    [20090512_link].t_ab a
ORDER BY
a, b

a b
1 3
1 7
1 12
1 13
1 16
1 17
1 21
1 22
1 26
1 42
10 rows fetched in 0.0004s (0.0007s)
Table 't_ab'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 
SELECT  TOP 10 *
FROM    [20090512_link].t_ab a
ORDER BY
a, b DESC

a b
1 999
1 998
1 994
1 990
1 989
1 970
1 969
1 966
1 965
1 961
10 rows fetched in 0.0004s (0.0007s)
Table 't_ab'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 
SELECT  TOP 10 *
FROM    [20090512_link].t_ab a
ORDER BY
b, a

a b
4 1
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
10 rows fetched in 0.0004s (0.0008s)
Table 't_ab'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 
SELECT  TOP 10 *
FROM    [20090512_link].t_ab a
ORDER BY
b, a DESC

a b
998 1
997 1
993 1
992 1
988 1
987 1
983 1
982 1
977 1
972 1
10 rows fetched in 0.0004s (0.0007s)
Table 't_ab'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 

No plan for any of the queries above contains a Sort operation. All values come sorted from the indexes.

Note that I didn't include a into the indexes on t_ab (b) and t_ab (b DESC).

This is because we have a clustered primary key on t_ab (a, b).

Each leaf in the index t_ab(b) actually contains a key value (b) and a row pointer value (a, b).

That's why the index on t_ab(b) is in fact ordered by (b, a, b), and the index on t_ab(b DESC) is ordered by (b DESC, a, b).

That's why there is no point to include a into these indexes: they already have it.

But do we really need all these indexes?

No, we don't.

Since t_a and t_b have clustered indexes on appropriate values, they may be used instead of actual indexes on t_ab. They have all required data and, taken together, they can serve as a single index in any direction.

Let's see what's happening when we sort on, say, B, A DESC:

  • All b.id's from the index on t_b(id) are read in the forward direction
  • For each b.id (read in forward direction), all leaves from t_ab (b) are read in reverse direction. For any given value of b, the values of a can be read from the index on t_ab (b) in any direction

To make this possible, we just need to join the link table with the dictionary tables.

Let's force the optimizer to use an index on t_ab (b) and see how it works:

SELECT  TOP 10
a.value, b.value
FROM    [20090512_link].t_ab ab WITH(INDEX(IX_b))
JOIN    [20090512_link].t_a a
ON      a.id = ab.a
JOIN    [20090512_link].t_b b
ON      b.id = ab.b
ORDER BY
ab.b, ab.a DESC
value value
Value A 998 Value B 1
Value A 997 Value B 1
Value A 993 Value B 1
Value A 992 Value B 1
Value A 988 Value B 1
Value A 987 Value B 1
Value A 983 Value B 1
Value A 982 Value B 1
Value A 977 Value B 1
Value A 972 Value B 1
10 rows fetched in 0.0004s (0.0011s)
Table 't_a'. Scan count 0, logical reads 20, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_ab'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_b'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 

This is also instant despite the index on t_ab(b) being created without DESC option.

Note that we can even include range conditions on either field, and still have no Sort in the plan:

SELECT  TOP 10
a.value, b.value
FROM    [20090512_link].t_ab ab WITH(INDEX(PK_ab))
JOIN    [20090512_link].t_a a
ON      a.id = ab.a
JOIN    [20090512_link].t_b b
ON      b.id = ab.b
WHERE   a BETWEEN 10 AND 15
AND b BETWEEN 30 AND 35
ORDER BY
ab.b, ab.a DESC
value value
Value A 15 Value B 30
Value A 10 Value B 30
Value A 15 Value B 31
Value A 10 Value B 31
Value A 13 Value B 32
Value A 12 Value B 32
Value A 13 Value B 33
Value A 12 Value B 33
Value A 15 Value B 34
Value A 10 Value B 34
10 rows fetched in 0.0004s (0.0012s)
Table 't_b'. Scan count 0, logical reads 24, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_ab'. Scan count 6, logical reads 18, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_a'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms. 

The index ranges will be applied dynamically inside each NESTED LOOP.

Summary:

For a many-to-many table, it's enough to create two indexes:

ALTER TABLE t_ab ADD CONSTRAINT PK_ab PRIMARY KEY (a, b)
CREATE INDEX IX_ab_b ON t_ab(b)

Used in a JOIN along with PRIMARY KEY's on t_a and t_b, they can efficiently serve all equality and range WHERE conditions and ORDER BY operations in any direction.

There is no need to add column a into the index on t_ab(b), since it's already contained there within the row pointer, which is in fact a PRIMARY KEY of the indexed record.

Since index leaves are sorted on (key, pointer), the whole index is in fact sorted on (b, a)

Written by Quassnoi

May 12th, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply