Archive for 2009
MySQL: difference between sets
From Stack Overflow:
I have a table that holds data about items that existed at a certain time (regular snapshots taken).
Simple example:
Timestamp ID 1 A 1 B 2 A 2 B 2 C 3 A 3 D 4 D 4 E In this case, item
C
gets created sometime between snapshot 1 and 2, sometime between snapshot 2 and 3B
andC
disappear andD
gets created, etc.The table is reasonably large (millions of records) and for each timestamp there are about 50 records.
What's the most efficient way of selecting the item
id
s for items that disappear between two consecutive timestamps?So for the above example I would like to get the following:
Previous snapshot Current snapshot Removed 1 2 NULL 2 3 B, C 3 4 A If it doesn't make the query inefficient, can it be extended to automatically use the latest (i. e.
MAX
) timestamp and the previous one?
We basically need to do the following things here:
- Split the table into sets grouped by timestamp
- Compare each set with the one of previous timestamp
- Find the values missing in the current set and concatenate them
This is possible to do using only the standard ANSI SQL operators, however, this will be inefficient in MySQL
.
Let's create a sample table and see how to work around this:
Passing parameters in MySQL: IN list vs. temporary table
When you need to pass a list of parameters into a MySQL, there are at least two method to do this:
- Issue a dynamical query which would fill an
IN
list with constant values - Fill a temporary table with these values then use it in a
JOIN
The second method is considered better for many reasons (you can easily reuse the values, it's more injection-proof etc.)
Leaving all these issues aside, let's compare these methods performance-wise.
We will create a table of 50,000 records, pass a list of 500 parameters (using both methods) and see which is faster:
Hierarchical queries in MySQL: VARCHAR keys
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Basit asks:
I'm trying to implement a hierarchical query in MySQL using solutions you posted in your article.
How can I make the tree menu if the
id
is uniqueVARCHAR
, like the ones used on youtube?
We will need to fix the query and the function just a little:
- The function should return a
VARCHAR
- The
@start_with
and other variables should be initialized withVARCHAR
s - The initial condition for
_id
should be a minimalVARCHAR
possible, i. e. an empty string
Let's create a sample table and check how it works:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_key ( id INT NOT NULL PRIMARY KEY, string_id VARCHAR(32) NOT NULL ) ENGINE=Memory; CREATE TABLE t_hierarchy ( id VARCHAR(32) NOT NULL PRIMARY KEY, parent VARCHAR(32) NOT NULL, data VARCHAR(50) NOT NULL, KEY ix_hierarchy_parent (parent) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(20); COMMIT; INSERT INTO t_key SELECT id, ( SELECT GROUP_CONCAT(CHAR(65 + FLOOR(RAND(20090817) * 25)) SEPARATOR '') FROM ( SELECT NULL FROM filler LIMIT 8 ) q ) FROM filler; INSERT INTO t_hierarchy SELECT ki.string_id, COALESCE(kp.string_id, '_'), CONCAT('Row ', f.id) FROM filler f JOIN t_key ki ON ki.id = f.id LEFT JOIN t_key kp ON kp.id = f.id DIV 3;
Here's the function:
CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value VARCHAR(32)) RETURNS VARCHAR(32) NOT DETERMINISTIC READS SQL DATA BEGIN DECLARE _id VARCHAR(32); DECLARE _parent VARCHAR(32); DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL; SET _parent = @id; SET _id = ''; IF @id IS NULL THEN RETURN NULL; END IF; LOOP SELECT MIN(id) INTO @id FROM t_hierarchy WHERE parent = _parent AND id > _id; IF @id IS NOT NULL OR _parent = @start_with THEN SET @level = @level + 1; RETURN @id; END IF; SET @level := @level - 1; SELECT id, parent INTO _id, _parent FROM t_hierarchy WHERE id = _parent; END LOOP; END $$
And here's the query:
SELECT CONCAT(REPEAT(' ', level - 1), CAST(hi.id AS CHAR)) AS treeitem, parent, level FROM ( SELECT hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level FROM ( SELECT @start_with := '_', @id := @start_with, @level := 0 ) vars, t_hierarchy WHERE @id IS NOT NULL ) ho JOIN t_hierarchy hi ON hi.id = ho.id
treeitem | parent | level | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FDBVDDGT | _ | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ENCYPEWC | FDBVDDGT | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AJXOXCQA | ENCYPEWC | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CLYLKREW | ENCYPEWC | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
YFCUVVRX | ENCYPEWC | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
JVDDGUJJ | FDBVDDGT | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ELQWNYGL | JVDDGUJJ | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
KTRBGCQA | JVDDGUJJ | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SCGYDUQU | JVDDGUJJ | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VEHXSYRL | FDBVDDGT | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BIKCFVOH | VEHXSYRL | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HAGFGSUW | VEHXSYRL | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TCBYTARM | VEHXSYRL | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GVPJESCI | _ | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HCNHAEWW | GVPJESCI | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
KNIDSGBM | GVPJESCI | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SJPBIMNF | GVPJESCI | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
JETHGIVJ | SJPBIMNF | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
OBPXSXJF | SJPBIMNF | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VPOXBLBW | SJPBIMNF | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 rows fetched in 0.0006s (0.0180s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 20 | 100.00 | |||||
1 | PRIMARY | hi | eq_ref | PRIMARY | PRIMARY | 98 | ho.id | 1 | 100.00 | |
2 | DERIVED | <derived3> | system | 1 | 100.00 | |||||
2 | DERIVED | t_hierarchy | index | PRIMARY | 98 | 20 | 100.00 | Using where; Using index | ||
3 | DERIVED | No tables used |
Works just fine.
Presentation layer
Comments enabled. I *really* need your comment
A question frequently asked by many web application developers:
Where do I build a presentation layer for my data?
Do I do it on SQL side or on PHP / ASP /whatever side?
This is a very interesting question.
A famous Chinese proverb (which like the majority of famous Chinese proverbs is invented by a Western journalist) says A picture is worth a thousand words
.

A picture is worth a thousand words.
Image by zedzap
How do you think, why it is? What's so good in a picture?
Seems to be a strange question. A picture just tells more. If you look at a picture, you get an overall impression, which can be hardly expressed using words. Words just don't pull the right strings in your mind, those that a picture does. And the words are harder to read, they are more numerous. And a dozen of other reasons.
However, the main reason seems to be this: you cannot give a complete description of a picture using words
.
But can it be so that a word is worth a thousand pictures?
Sure it can. And any book fan who had seen the screen version of his favorite work can confirm it.
From Douglas Adams's masterpiece The Hitchiker's Guide to the Galaxy
:
A hole had just appeared in the Galaxy. It was exactly a nothingth of a second long, a nothingth of an inch wide, and quite a lot of million light years from end to end.
Or this one:
Five wild Event Maelstroms swirled in vicious storms of unreason and spewed up a pavement. On the pavement lay Ford Prefect and Arthur Dent gulping like half-spent fish.
Adams knew how to arrange his words so that they draw a perfect, vivid picture in the reader's mind.
The problem is that everyone gets his very own picture.
What do you imagine when reading of a storm of unreason
? Can you show a nothingth of an inch wide
on your fingers? Can you sketch some Vogon poetry to show how really bad it is?
Oracle: updating rows with partitioned incremental value
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table with these example data:
id idx data 1 1 A 1 3 B 2 1 C 2 1 D 2 5 E I want:
id idx data 1 1 A 1 2 B 2 1 C 2 3 D 2 4 E Is there a way to renumber the
index
perid
using a single SQLUPDATE
?Note that the order of items should be preserved (i. e. the item E should still be after the items C and D but the order of C and D doesn't really matter).
The goal is to be able to create a primary key over
id
andindex
.
If not for the id
, the query would be very simple:
UPDATE t_update SET index = rownum
However, the query above does not take id
into account.
To update the table using partitioned row numbers, we need to employ the window function ROW_NUMBER()
.
However, it's usable neither in the UPDATE
query itself, nor in an inline view, and attempt to use it produces an error:
ORA-30483: window functions are not allowed here
To work around the, we should rewrite the UPDATE
query as MERGE
.
MERGE
requires two rowsets (source and target) and they need to be joined on something.
We will use the table itself as a source They only field we can join on here is ROWID
pseudocolumn.
Here's the query to do this:
MERGE INTO t_update u USING ( SELECT rowid AS rid, ROW_NUMBER() OVER (PARTITION BY id ORDER BY idx) AS rn FROM t_update ) ON (u.rowid = rid) WHEN MATCHED THEN UPDATE SET idx = rn
, and here's the result:
SELECT * FROM t_update
ID | IDX | DATA |
---|---|---|
1 | 1 | A |
1 | 2 | B |
2 | 1 | C |
2 | 2 | D |
2 | 3 | E |
Works fine.
Efficient date range query: SQL Server
From Stack Overflow:
I have a table
A
that has astartDate
and anendDate
as 2DATETIME
columns besides some more other columns.I have another table
B
that has oneDATETIME
column (call itdates
column).This is in SQL Server 2005.
How to best set up the indexes etc to get the following:
SELECT *
FROM a, b
WHERE a.startDate >= b.dates
AND a.endDate < b.dates [/sourcecode] Both tables have several thousand records.
Queries of this kind involve searching a table for a range that contains a given constant and plain B-Tree indexes are not good for this query.
A B-Tree index (or any other index that relies on sorting) is efficient for searching a column in a range between a minimal and maximal value (an unknown value between two constants).
However, this query is just the opposite: we search for a constaint between two unknown values.
If we create the indexes on startDate
and endDate
and issue a query with a condition like this:
startDate <= value AND endDate > value
, this will just give two range condition on two distinct indexes:
-INF <= startDate <= value
value < endDate <= +INF
Each of these ranges is too large to be efficient for searching, though their intersection can be quite small.
There is a special class of indexes: R-Tree indexes, which handle the conditions as described above just perfectly. They are mostly used to query 2-dimensional GIS data, that's why in MySQL they are called SPATIAL
indexes.
In both these systems they can only be defined over geometry data which implies 2-dimensional objects (Point
s, LineString
s and such).
There is a way to use these indexes to speed up the query described above (see the article behind the link on how to do it in MySQL):
However, SQL Server 2005 does not support SPATIAL indexes.
In theory, an R-Tree index could be built over 1-dimensional data just as well, but no database of the major four supports it (as for now).
However, quite a close structure can be implemented using plain B-Tree indexes, and in this article I'll show how to do it.
Let's create a sample table:
Read the rest of this entry »
Fallback language names: MySQL
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
A quick reminder of the problem taken from Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In Oracle, SQL Server and PostgreSQL, the second method (two LEFT JOIN
s) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN
for the local language and a subquery for the fallback one.
To gain efficiency, all these systems used some kind of a join method that performs better on the large portions of the rowset, i. e. HASH JOIN
or MERGE JOIN
.
MySQL, however, is only capable of doing nested loops, so MySQL's performance should differ from one of the engines tested earlier.
Let's create sample tables and see:
Fallback language names: PostgreSQL
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
A quick reminder of the problem taken from Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOIN
s) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN
for the local language and a subquery for the fallback one.
Now, let's create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »
Fallback language names: SQL Server
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
From Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Yesterday, we saw that the second option (two LEFT JOIN
s) is best in Oracle if the probability of the fallback is high (i. e. the localization is poor and but few terms are translated into the local language). I took Latin language as an example for this.
Third query (a LEFT JOIN
with the local language rowset and a subquery in a COALESCE
to retrive the fallback language names) is best if there is a low probability of a fallback, i. e. almost all terms are translated. In this case, I took French as an example, since it is widely used and localizations are likely to cover most terms.
Today we will see how same queries work in SQL Server.
Let's create sample tables:
Fallback language names: Oracle
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
From Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
This is quite a common database design (most often used for localization), so I'd like to cover this question for different databases.
This design is not limited to languages and localization: it may be used for any data with default value that can be overriden. Prices, discounts, options — many kinds of things.
Today we will see how do it in Oracle.
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Let's create sample tables and see what is more efficient.
Read the rest of this entry »