Archive for 2009
Descending indexes
From Stack Overflow:
When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending.
I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way?
What difference does it make which order I choose?
This matters when the composite indexes are used.
Let's create a sample table:
Read the rest of this entry »
Keeping latest rows for a group
From Stack Overflow:
Assume I have a table
foo
where I have something like this:
id, user_id, timestamp, some_value
What I want to do is remove all rows that aren't the newest
N
per user.The deletion itself could be handled by a query like this:
DELETE FROM foo WHERE id NOT IN (...)so you could rephrase the problem into this: how do I get the newest
N
(there might be less) rows for each user.This means if I have
U
users I may end up withN×U
rows, soLIMIT
won't really work.
Unfortunately, NOT IN
will not work here. MySQL doesn't allow using the target table in an IN
or NOT IN
clause in a DELETE
or UPDATE
statement.
But some time ago I posted two solutions:
- Keeping rows — how to
DELETE
all rows exceptN
- Advanced row sampling — how to select
TOP N
rows for eachGROUP
Now it's time to use these solutions together.
What we need to do here is:
- Select
DISTINCT user_id
's from the table - In a subquery,
SELECT
n'thtimestamp
and n'thid
for eachuser_id
. We need to select both these fields to handle the possible duplicates oftimestamp
's. Unfortunately, MySQL doesn't allow multiple columns in a subquery used in aSELECT
clause, that's why we need two subselects (ugly, I know) - Join the resultset with original table, for each
user_id
selecting all rows with(timestamp, id)
lesser than the(n'th timestamp, n'th id)
selected above DELETE
from the joined table. UnlikeIN
andNOT IN
conditions, using the target table in aJOIN
is OK for MySQL
Now, let's create the sample table and see how it works:
This table has 10,000 records for 5 users.
Now, let's try to delete all records except the latest 3 for each user_id
:
DELETE l.* FROM foo l JOIN ( SELECT user_id, COALESCE( ( SELECT timestamp FROM foo li WHERE li.user_id = dlo.user_id ORDER BY li.user_id DESC, li.timestamp DESC LIMIT 2, 1 ), CAST('0001-01-01' AS DATETIME)) AS mts, COALESCE( ( SELECT id FROM foo li WHERE li.user_id = dlo.user_id ORDER BY li.user_id DESC, li.timestamp DESC, li.id DESC LIMIT 2, 1 ), -1) AS mid FROM ( SELECT DISTINCT user_id FROM foo dl ) dlo ) lo ON l.user_id = lo.user_id AND (l.timestamp, l.id) < (mts, mid)
We assume here that there are no records dated earlier than 1 AD, Jan 1 and no negative id
's
Let's see what's left after the query finishes:
SELECT * FROM foo ORDER BY user_id, timestamp, id
id | user_id | timestamp | some_value | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5950 | 1 | 2009-04-25 23:56:41 | Value 5950 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
935 | 1 | 2009-04-25 23:58:52 | Value 935 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
95 | 1 | 2009-04-25 23:59:07 | Value 95 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8461 | 2 | 2009-04-25 23:57:18 | Value 8461 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4631 | 2 | 2009-04-25 23:58:14 | Value 4631 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6626 | 2 | 2009-04-25 23:59:10 | Value 6626 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7487 | 3 | 2009-04-25 23:55:41 | Value 7487 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
147 | 3 | 2009-04-25 23:56:08 | Value 147 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
27 | 3 | 2009-04-25 23:59:49 | Value 27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2698 | 4 | 2009-04-25 23:59:17 | Value 2698 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1868 | 4 | 2009-04-25 23:59:40 | Value 1868 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7938 | 4 | 2009-04-25 23:59:56 | Value 7938 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8929 | 5 | 2009-04-25 23:57:59 | Value 8929 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1934 | 5 | 2009-04-25 23:58:19 | Value 1934 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6179 | 5 | 2009-04-25 23:58:22 | Value 6179 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 rows fetched in 0.0005s (0.0020s) |
Just like requested: 15 records left, 3 latest records for each user_id
.
Selecting timestamps for a time zone
Comments enabled. I *really* need your comment
From Stack Overflow:
How can I reuse a computed column in an SQL query in MySQL?
My query is something like this:
SELECT CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
WHERE CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) < CURRENT_TIMESTAMP() [/sourcecode] As you can see, theCONVERT_TZ (…)
part is repeated.This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.
Is there any way to avoid such duplication?
The duplication part is simple. We just need to wrap the query in a subquery or use HAVING
clause, like this:
SELECT * FROM ( SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id ) q WHERE time_usr_tz < CURRENT_TIMESTAMP()
, or this:
SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id HAVING time_usr_tz < CURRENT_TIMESTAMP()
But what I'd really like to write about today is the performance of such a query.
This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.
The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.
That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at 9:00 and reminded to take a pill at 13:00, but if she goes from Washington, D.C. to Honolulu, she may want all her alarm clocks to change from EST to HST. Instead of updating all the timestamps it's of course easier to update a single field in her profile.
The query uses columns from both tables inside a function in the WHERE
condition, that's why no indexes can be used to filter the timestamps.
If schedule
is large enough, this can be a problem for a busy service.
Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only timestamp_start
column in the schedules
, which is enough to illustrate the point.
Read the rest of this entry »
Selecting lowest value
Comments enabled. I *really* need your comment
From Stack Overflow:
Say I have the following data:
Name Value Small 10 Medium 100 Large 1000 Imagine that these represent the volumes of boxes.
I have some items that I want to put in the boxes, and I want the smallest box possible.
I need an
SQL
query that will:
- Return the row with the smallest row greater than my query parameter
- If there is no such row, then return the largest row
I'm on Oracle 11g, so any special Oracle goodness is OK.
There certainly is a special Oracle goodness, and I'll cover it in this article.
Let's create the sample table and fill it with 1,000,000 rows to illustrate the point:
Read the rest of this entry »
Making an index UNIQUE
From Stack Overflow:
In SQL Server, if I know an index will have unique values, how will it affect performance on
INSERT
's orSELECT
's if I declare it as such?If the optimiser knows the index is
UNIQUE
, how will that affect the query plan?I understand that specifying uniqueness can serve to preserve integrity, but leaving that discussion aside for the moment, what are the perfomance consequences?
Let's see how does a UNIQUE
index perform compared to a non-UNIQUE
one.
Indexes mentioned hereafter are considered to be built on a same intrinsically unique dataset. I will leave consistency issues aside, focusing only on performance details.
A UNIQUE
index, contrary to popular opinion, does not have any performance drawbacks on DML
compared to a plain, non-UNIQUE
index.
When the engine tries to insert a new row into the indexed table, being the index UNIQUE
or not, it performs the following steps:
Read the rest of this entry »
Counting missing rows: PostgreSQL
Comments enabled. I *really* need your comment
This is the 5th of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Finally let's look how PostgreSQL
copes with this predicate.
Let's create sample tables:
Read the rest of this entry »
Counting missing rows: Oracle
Comments enabled. I *really* need your comment
This is the 4th of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Let's see how Oracle processes this predicate.
Let's create the sample tables:
Read the rest of this entry »
Counting missing rows: SQL Server
Comments enabled. I *really* need your comment
This is the 3rd of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Now, let's see how the same predicate is implemented in SQL Server.
We'll create sample tables just like those we used yesterday:
Read the rest of this entry »
Counting missing rows: using DISTINCT
Comments enabled. I *really* need your comment
This is the 2nd of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Today, I will tell about some useful tricks that help to optimize these subqueries even more.
Let's create some sample tables:
Read the rest of this entry »
Counting missing rows
Comments enabled. I *really* need your comment
From Stack Overflow:
In my first programming job I was told that a query within the
IN()
predicate gets executed for every row contained in the parent query, and therefore usingIN
should be avoided.For example, given the query:
SELECT COUNT(*) FROM Table1 WHERE Table1Id NOT IN ( SELECT Table1Id FROM Table2 WHERE id_user = 1 )
Rows Inner subquery executions 10 10 100 100 1000 1000 10000 10000 Is this correct? How does the
IN
predicate actually work?
This is the 1st of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
In this article I will describe how MySQL implements this predicate.
Let's create the sample tables:
Read the rest of this entry »