EXPLAIN EXTENDED

Written by Quassnoi

August 23rd, 2009 at 11:00 pm

Posted in Miscellaneous

SQL Server: finding date closest to a birthdate

The problem is with dynamically getting upcoming what we call namedays in Europe.

Name day is a tradition in many countries in Europe and Latin America of celebrating on a particular day of the year associated with the one's given name.

Here's a sampe name day table (I store the date in this format: 1900-08-22, but we really only need month and day):

First name Name day

Bob 1900-04-22

Bob 1900-09-04

Frank 1900-01-02

The trick is that there might be multiple name days for each name, and someones name day is always the first one found after ones birthday.

So if Bob was born on August 5th, his name day would fall on September 4th, but if we were born after Sep 4th, his name day would be on April 22nd.

We need to find when a given person celebrates his/her name day based on the first name and birthdate.

What I need is an SQL Server query that will be able to get me name days for the people in my database.

First name	Name day
Bob	1900-04-22
Bob	1900-09-04
Frank	1900-01-02

We actually need to make the calendar to wrap around: to build a condition that would select 1904, Jan 1st after 1904, Dec 31th.

We can do it using a COALESCE on the simple queries in OUTER APPLY clause.

Let's create sample tables:

Written by Quassnoi

August 21st, 2009 at 11:00 pm

Posted in SQL Server

SQL Server: efficient DISTINCT on dates

I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.

My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using DISTINCT).

The problem is whether there is any trick to get it done more quickly.

With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.

The problem with this query is that no distinct list of truncated dates is easily accessible.

That's why SQL Server has to fetch all the dates from the table (or from the index) and distinctify them somehow.

In Oracle we could probably create a function-based index that would index the truncated part of a DATETIME column. Then it would just use jump over the distinct index keys.

Same is possible in MySQL (however, the latter does not support function-based indexes, so one has to update the date-only column manually).

SQL Server does support indexes on computed columns, however, its optimizer lacks ability to skip distinct records in an index.

That's why creating an index on a computed column doesn't help.

But since we are talking about the dates, it's possible to emulate this behavior to some extent using a recursive CTE that would generate a list of dates, and the checking each date.

Let's create a sample table:

Written by Quassnoi

August 20th, 2009 at 11:00 pm

Posted in SQL Server

MySQL: difference between sets

with one comment

I have a table that holds data about items that existed at a certain time (regular snapshots taken).

Simple example:

Timestamp ID

1 A

1 B

2 A

2 B

2 C

3 A

3 D

4 D

4 E

In this case, item C gets created sometime between snapshot 1 and 2, sometime between snapshot 2 and 3 B and C disappear and D gets created, etc.

The table is reasonably large (millions of records) and for each timestamp there are about 50 records.

What's the most efficient way of selecting the item ids for items that disappear between two consecutive timestamps?

So for the above example I would like to get the following:

Previous snapshot Current snapshot Removed

1 2 NULL

2 3 B, C

3 4 A

If it doesn't make the query inefficient, can it be extended to automatically use the latest (i. e. MAX) timestamp and the previous one?

Timestamp	ID
1	A
1	B
2	A
2	B
2	C
3	A
3	D
4	D
4	E

Previous snapshot	Current snapshot	Removed
1	2	NULL
2	3	B, C
3	4	A

We basically need to do the following things here:

Split the table into sets grouped by timestamp
Compare each set with the one of previous timestamp
Find the values missing in the current set and concatenate them

This is possible to do using only the standard ANSI SQL operators, however, this will be inefficient in MySQL.

Let's create a sample table and see how to work around this:

Written by Quassnoi

August 19th, 2009 at 11:00 pm

Posted in MySQL

Passing parameters in MySQL: IN list vs. temporary table

with 7 comments

When you need to pass a list of parameters into a MySQL, there are at least two method to do this:

Issue a dynamical query which would fill an IN list with constant values
Fill a temporary table with these values then use it in a JOIN

The second method is considered better for many reasons (you can easily reuse the values, it's more injection-proof etc.)

Leaving all these issues aside, let's compare these methods performance-wise.

We will create a table of 50,000 records, pass a list of 500 parameters (using both methods) and see which is faster:

Written by Quassnoi

August 18th, 2009 at 11:00 pm

Posted in MySQL

Hierarchical queries in MySQL: VARCHAR keys

Answering questions asked on the site.

Basit asks:

I'm trying to implement a hierarchical query in MySQL using solutions you posted in your article.

How can I make the tree menu if the id is unique VARCHAR, like the ones used on youtube?

We will need to fix the query and the function just a little:

The function should return a VARCHAR
The @start_with and other variables should be initialized with VARCHARs
The initial condition for _id should be a minimal VARCHAR possible, i. e. an empty string

Let's create a sample table and check how it works:

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_key (
        id INT NOT NULL PRIMARY KEY,
        string_id VARCHAR(32) NOT NULL
) ENGINE=Memory;

CREATE TABLE t_hierarchy (
        id VARCHAR(32) NOT NULL PRIMARY KEY,
        parent VARCHAR(32) NOT NULL,
        data VARCHAR(50) NOT NULL,
        KEY ix_hierarchy_parent (parent)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(20);
COMMIT;

INSERT
INTO    t_key
SELECT  id,
        (
        SELECT  GROUP_CONCAT(CHAR(65 + FLOOR(RAND(20090817) * 25)) SEPARATOR '')
        FROM    (
                SELECT  NULL
                FROM    filler
                LIMIT 8
                ) q
        )
FROM    filler;

INSERT
INTO    t_hierarchy
SELECT  ki.string_id, COALESCE(kp.string_id, '_'), CONCAT('Row ', f.id)
FROM    filler f
JOIN    t_key ki
ON      ki.id = f.id
LEFT JOIN
        t_key kp
ON      kp.id = f.id DIV 3;

Here's the function:

CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value VARCHAR(32))
RETURNS VARCHAR(32)
NOT DETERMINISTIC
READS SQL DATA
BEGIN
        DECLARE _id VARCHAR(32);
        DECLARE _parent VARCHAR(32);
        DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL;

        SET _parent = @id;
        SET _id = '';

        IF @id IS NULL THEN
                RETURN NULL;
        END IF;

        LOOP
                SELECT  MIN(id)
                INTO    @id
                FROM    t_hierarchy
                WHERE   parent = _parent
                        AND id > _id;
                IF @id IS NOT NULL OR _parent = @start_with THEN
                        SET @level = @level + 1;
                        RETURN @id;
                END IF;
                SET @level := @level - 1;
                SELECT  id, parent
                INTO    _id, _parent
                FROM    t_hierarchy
                WHERE   id = _parent;
        END LOOP;       
END
$$

And here's the query:

SELECT  CONCAT(REPEAT('    ', level - 1), CAST(hi.id AS CHAR)) AS treeitem, parent, level
FROM    (
        SELECT  hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level
        FROM    (
                SELECT  @start_with := '_',
                        @id := @start_with,
                        @level := 0
                ) vars, t_hierarchy
        WHERE   @id IS NOT NULL
        ) ho
JOIN    t_hierarchy hi
ON      hi.id = ho.id

treeitem	parent	level
FDBVDDGT	_	1
ENCYPEWC	FDBVDDGT	2
AJXOXCQA	ENCYPEWC	3
CLYLKREW	ENCYPEWC	3
YFCUVVRX	ENCYPEWC	3
JVDDGUJJ	FDBVDDGT	2
ELQWNYGL	JVDDGUJJ	3
KTRBGCQA	JVDDGUJJ	3
SCGYDUQU	JVDDGUJJ	3
VEHXSYRL	FDBVDDGT	2
BIKCFVOH	VEHXSYRL	3
HAGFGSUW	VEHXSYRL	3
TCBYTARM	VEHXSYRL	3
GVPJESCI	_	1
HCNHAEWW	GVPJESCI	2
KNIDSGBM	GVPJESCI	2
SJPBIMNF	GVPJESCI	2
JETHGIVJ	SJPBIMNF	3
OBPXSXJF	SJPBIMNF	3
VPOXBLBW	SJPBIMNF	3
20 rows fetched in 0.0006s (0.0180s)

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	ALL					20	100.00
1	PRIMARY	hi	eq_ref	PRIMARY	PRIMARY	98	ho.id	1	100.00
2	DERIVED	<derived3>	system					1	100.00
2	DERIVED	t_hierarchy	index		PRIMARY	98		20	100.00	Using where; Using index
3	DERIVED									No tables used

Works just fine.

Written by Quassnoi

August 17th, 2009 at 11:00 pm

Posted in MySQL

Presentation layer

A question frequently asked by many web application developers:

Where do I build a presentation layer for my data?

Do I do it on SQL side or on PHP / ASP /whatever side?

This is a very interesting question.

A famous Chinese proverb (which like the majority of famous Chinese proverbs is invented by a Western journalist) says A picture is worth a thousand words.

A picture is worth a thousand words.
Image by zedzap

How do you think, why it is? What's so good in a picture?

Seems to be a strange question. A picture just tells more. If you look at a picture, you get an overall impression, which can be hardly expressed using words. Words just don't pull the right strings in your mind, those that a picture does. And the words are harder to read, they are more numerous. And a dozen of other reasons.

However, the main reason seems to be this: you cannot give a complete description of a picture using words.

But can it be so that a word is worth a thousand pictures?

Sure it can. And any book fan who had seen the screen version of his favorite work can confirm it.

From Douglas Adams's masterpiece The Hitchiker's Guide to the Galaxy:

A hole had just appeared in the Galaxy. It was exactly a nothingth of a second long, a nothingth of an inch wide, and quite a lot of million light years from end to end.

Or this one:

Five wild Event Maelstroms swirled in vicious storms of unreason and spewed up a pavement. On the pavement lay Ford Prefect and Arthur Dent gulping like half-spent fish.

Adams knew how to arrange his words so that they draw a perfect, vivid picture in the reader's mind.

The problem is that everyone gets his very own picture.

What do you imagine when reading of a storm of unreason? Can you show a nothingth of an inch wide on your fingers? Can you sketch some Vogon poetry to show how really bad it is?

Written by Quassnoi

August 16th, 2009 at 11:00 pm

Posted in Miscellaneous

Oracle: updating rows with partitioned incremental value

I have a table with these example data:

id idx data

1 1 A

1 3 B

2 1 C

2 1 D

2 5 E

I want:

id idx data

1 1 A

1 2 B

2 1 C

2 3 D

2 4 E

Is there a way to renumber the index per id using a single SQL UPDATE?

Note that the order of items should be preserved (i. e. the item E should still be after the items C and D but the order of C and D doesn't really matter).

The goal is to be able to create a primary key over id and index.

If not for the id, the query would be very simple:

UPDATE  t_update
SET       index = rownum

However, the query above does not take id into account.

To update the table using partitioned row numbers, we need to employ the window function ROW_NUMBER().

However, it's usable neither in the UPDATE query itself, nor in an inline view, and attempt to use it produces an error:

ORA-30483: window  functions are not allowed here

To work around the, we should rewrite the UPDATE query as MERGE.

MERGE requires two rowsets (source and target) and they need to be joined on something.

We will use the table itself as a source They only field we can join on here is ROWID pseudocolumn.

Here's the query to do this:

MERGE
INTO    t_update u
USING   (
SELECT  rowid AS rid,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY idx) AS rn
FROM    t_update
)
ON      (u.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET     idx = rn

, and here's the result:

SELECT  *
FROM    t_update

ID	IDX	DATA
1	1	A
1	2	B
2	1	C
2	2	D
2	3	E

Works fine.

Written by Quassnoi

August 12th, 2009 at 11:00 pm

Posted in Oracle

Efficient date range query: SQL Server

with one comment

I have a table A that has a startDate and an endDate as 2 DATETIME columns besides some more other columns.

I have another table B that has one DATETIME column (call it dates column).

This is in SQL Server 2005.

How to best set up the indexes etc to get the following:

SELECT *
FROM a, b
WHERE a.startDate >= b.dates
AND a.endDate < b.dates [/sourcecode] Both tables have several thousand records.

Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.

A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).

However, this query is just the opposite: we search for a constaint between two unknown values.

If we create the indexes on startDate and endDate and issue a query with a condition like this:

startDate <= value AND endDate > value

, this will just give two range condition on two distinct indexes:

-INF <= startDate <= value
value < endDate <= +INF

Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.

There is a special class of indexes: R-Tree indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called SPATIAL indexes.

In both these systems they can only be defined over geometry data which implies 2-dimensional objects (Points, LineStrings and such).

There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):

Overlapping ranges: MySQL

However, SQL Server 2005 does not support SPATIAL indexes.

In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).

However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I'll show how to do it.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

August 11th, 2009 at 11:00 pm

Posted in SQL Server

Fallback language names: MySQL

This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:

A quick reminder of the problem taken from Stack Overflow:

I have table item and another table language which contains names for the items in different languages:

item language data

How do I select a French name for an item if it exists, or a fallback English name if there is no French one?

We basically have three options here:

Use COALESCE on two SELECT list subqueries
Use COALESCE on the results of two LEFT JOINS
Use the combination of methods above: a LEFT JOIN for French names and a subquery for English ones

Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).

If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.

If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.

In Oracle, SQL Server and PostgreSQL, the second method (two LEFT JOINs) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN for the local language and a subquery for the fallback one.

To gain efficiency, all these systems used some kind of a join method that performs better on the large portions of the rowset, i. e. HASH JOIN or MERGE JOIN.

MySQL, however, is only capable of doing nested loops, so MySQL's performance should differ from one of the engines tested earlier.

Let's create sample tables and see: