Indexing tiny tables
From Stack Overflow:
Hypothetically, in a SQL Server database, if I have a table with two
INT
fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the twoINT
fields overcomes the overhead imposed by said indexes?
Very nice question.
Imagine we have a table table1 (col INT NOT NULL, value1 INT NOT NULL, value2 VARCHAR(100) NOT NULL, …)
with an index on col
.
In SQL Server, an index can improve the following things.
-
Index lookups (range scans). An index can be used to limit the number of rows scanned:
SELECT * FROM table1 t1 WHERE t1.col BETWEEN @start AND @end
- Secondary data source. This query will get all data it need from the index:
SELECT t1.col FROM table1 t1
- Ordered rowsets. This query will not need to sort:
SELECT * FROM table1 t1 ORDER BY t1.col
- Efficient
MIN
/MAX
:SELECT MIN(col) FROM table1 t1
The latter case is probably the best illustration of the fact that even a table of 2 rows can benefit from creating an index.
Let's create the sample tables:
CREATE SCHEMA [20090621_index] CREATE TABLE t_lookup ( id INT NOT NULL, ival INT NOT NULL, uval INT NOT NULL ) CREATE TABLE t_main ( id INT NOT NULL PRIMARY KEY, value INT NOT NULL ) GO ALTER TABLE [20090621_index].t_lookup ADD CONSTRAINT pk_lookup_id PRIMARY KEY (id) CREATE INDEX ix_lookup_ival ON [20090621_index].t_lookup (ival) BEGIN TRANSACTION SELECT RAND(20090621) DECLARE @cnt INT SET @cnt = 1 WHILE @cnt <= 1000000 BEGIN INSERT INTO [20090621_index].t_main VALUES (@cnt, FLOOR(RAND() * 20)) SET @cnt = @cnt + 1 END SET @cnt = 1 WHILE @cnt <= 2 BEGIN INSERT INTO [20090621_index].t_lookup VALUES (@cnt, @cnt * 10, @cnt * 10) SET @cnt = @cnt + 1 END COMMIT GO [/sourcecode] There are two tables. The first table, <code>t_main</code>, has <strong>1,000,000</strong> rows with an integer column <code>value</code> randomly filled with values from <strong>0</strong> to <strong>19</strong>. The second table, <code>t_lookup</code>, has but <strong>2</strong> rows: SELECT * FROM [20090621_index].t_lookup
id | ival | uval |
---|---|---|
1 | 10 | 10 |
2 | 20 | 20 |
Column ival
is indexed, column uval
is not.
Now, let's issue quite a common query: for each row from t_main
, we will select minimal value from t_lookup
which is more than t_main.value
.
First, let's do it with non-indexed column uval
:
SELECT SUM(val) FROM ( SELECT ( SELECT MIN(uval) FROM [20090621_index].t_lookup WHERE uval > value ) AS val FROM [20090621_index].t_main ) m
14994870 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (3.0469s) |
Table 't_lookup'. Scan count 1000000, logical reads 2000000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_main'. Scan count 3, logical reads 4221, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 6000 ms, elapsed time = 3040 ms.
SELECT Compute Scalar Stream Aggregate Parallelism (Gather Streams) Stream Aggregate Nested Loops (Inner Join) Clustered Index Scan ([20090621_index].[t_main]) Stream Aggregate Clustered Index Scan ([20090621_index].[t_lookup])
This query run for a little more than 3 seconds, which is very efficient for a query that aggregates 1,000,000 values.
Note that CPU time
is twice as elapsed time
: the operations are parallelized and the results are gathered using Gather Streams
.
And now let's try the same query on an indexed column:
SELECT SUM(val) FROM ( SELECT ( SELECT MIN(ival) FROM [20090621_index].t_lookup WHERE ival > value ) AS val FROM [20090621_index].t_main ) m
14994870 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 row fetched in 0.0002s (1.5312s) |
Table 't_main'. Scan count 3, logical reads 4221, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 't_lookup'. Scan count 20, logical reads 40, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 2406 ms, elapsed time = 1529 ms.
SELECT Compute Scalar Stream Aggregate Parallelism (Gather Streams) Stream Aggregate Nested Loops (Inner Join) Sort Repartition Streams Clustered Index Scan ([20090621_index].[t_main]) Lazy Spool Stream Aggregate Index Seek ([20090621_index].[t_lookup].[ix_lookup_val])
Now we see that the execution plan had changed drastically.
Since SQL Server sees an ordered rowset in the subquery, it understands that sorting an outer query rowset too will keep the number of reevaluations to minumum.
Indeed, it's better to calculate the subquery once for all ival > 0
, then for all ival > 1
etc., than calculating it each time for a new random value.
Sorting t_main
, however, is a complex task. That's why SQL Server doesn't actually sort: instead, it multiplexes the input rowset streams (which are read in parallel from the table) into several output streams, based on the value
. Each output stream hold the rows with the same value
. Sorting then comes to just selecting a right stream and reading from it first.
Using this plan makes each row from t_main
compare to only one value, not two. And though the comparison itself takes but a few microseconds, this needs to be done 1,000,000 times, and it's when the index pays for itself.
The query on the indexed field run for only 1.5 seconds, twice as fast as the same query on a non-indexed field.
Well said. I know that this is an old post. The info seems as valid now as it was then.
Christopher
14 Aug 20 at 03:09