EXPLAIN EXTENDED

How to create fast database queries

Generating row placeholders: SQL Server

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Vladimir asks:

I have two tables: one contains document headers, the other one contains document items.

The document header defines how many items should be in this document (usually 10 or 15, but this can vary).

I need to generate a list of documents and a list of document items to feed them to a third party control. Normally, this could be easily achieved by doing a SELECT * from both tables, but there is a problem. The control requires that the number of items for each document is equal to the value of the columns in the document header.

That is, if document header says 15 items, there should be exactly 15 records for this document in the items rowset, even if there are less items in the actual table. It's OK if the extra items have a value of NULL.

I tried to use LEFT JOIN and FULL JOIN but cannot figure out how to use them.

This is quite a common problem in SQL world.

SQL as such has no means to generate sets, only to transform them. In classical SQL, if we have a relation that contains 2 records on input and need to have a relation that contains 16 records on output, then we will need to do at least 4 cross joins — just to have enough rows to begin with.

There are two solutions to that problem.

First is having a dummy rowset handy: create a table of 1,000,000 records or use some kind of a system table whose only purpose is to provide enough rows (the actual values of them can be completely ignored).

Second solution is to generate rows on the fly using system-provided means. This is impossible in MySQL, however, all other major systems can do it using generate_series (PostgreSQL, the only system that has a native tool specially for this purpose); CONNECT BY query (Oracle); or, in your case, a recursive CTE (SQL Server).

Recursive queries are somewhat less efficient than stored tables, since they require looping and therefore having much logical reads, while reading from a dummy table requires but a few table scans.

However, the recursive queries also have several advantages:

  • They are always available, even if you querying a database that lacks a dummy table and you don't have a write access to the database
  • They always have enough records
  • They have consistent plans. If the dummy table contains lots of records and the optimizer chooses a fullscan, this can kill performance

That's why I'll base my solution on the recursive CTE.

The main idea to generate a list of document items is simple:

  • Each document header we should replicate exactly as many times as the header says (using a recursive CTE), with an ordinal number
  • We also should assign an ordinal number to each document item, using ROW_NUMBER()
  • Finally, we should just left join the two resultsets

The query will look like this:

header_rows (id, rn) AS
(
SELECT  id, items
FROM    headers
UNION ALL
SELECT  hr.id, hr.rn - 1
FROM    header_rows hr
JOIN    headers h
ON      h.id = hr.id
AND hr.rn > 1
),
item_rows AS
(
SELECT  i.*, ROW_NUMBER() OVER (PARTITION BY document ORDER BY id) AS rn
FROM    items i
)
SELECT  hr.id, hr.rn, ir.name
FROM    header_rows hr
LEFT JOIN
item_rows ir
ON      ir.document = hr.id
AND ir.rn = hr.rn
ORDER BY
hr.id, hr.rn

And here's what it returns on some generated data:

WITH    headers (id, items) AS
(
SELECT  1, 5
UNION ALL
SELECT  2, 10
),
items (id, document, name) AS
(
SELECT  1, 1, 'Document 1'
UNION ALL
SELECT  2, 1, 'Document 2'
UNION ALL
SELECT  3, 2, 'Document 3'
UNION ALL
SELECT  4, 2, 'Document 4'
UNION ALL
SELECT  5, 2, 'Document 5'
),
header_rows (id, rn) AS
(
SELECT  id, items
FROM    headers
UNION ALL
SELECT  hr.id, hr.rn - 1
FROM    header_rows hr
JOIN    headers h
ON      h.id = hr.id
AND hr.rn > 1
),
item_rows AS
(
SELECT  i.*, ROW_NUMBER() OVER (PARTITION BY document ORDER BY id) AS rn
FROM    items i
)
SELECT  hr.id, hr.rn, ir.name
FROM    header_rows hr
LEFT JOIN
item_rows ir
ON      ir.document = hr.id
AND ir.rn = hr.rn
ORDER BY
hr.id, hr.rn
id rn name
1 1 Document 1
1 2 Document 2
1 3
1 4
1 5
2 1 Document 3
2 2 Document 4
2 3 Document 5
2 4
2 5
2 6
2 7
2 8
2 9
2 10

We see that there are two documents, the first requiring 5 items, the second requiring 10 ones. However, there are only 2 and 3 items for the first and the second document, respectively.

The query generated empty placeholders for the missing items, filled them will NULL values and returned 5 and 10 rows in total for the appropriate documents.

Hope that helps.


I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

October 21st, 2009 at 11:00 pm

Posted in SQL Server

Leave a Reply