Archive for the ‘MySQL’ Category
UNION vs UNION ALL
Comments enabled. I *really* need your comment
In my previous article:
I described efficient ORDER BY
along with LIMIT
for UNION
'ed resultsets.
One thing left unanswered: why UNION DISTINCT
in some cases is faster than UNION ALL
?
UNION DISTINCT
(for which a mere UNION
is an alias) implies selecting DISTINCT
values from the resultsets, while UNION ALL
just selects all values, including duplicates.
On one hand, UNION ALL
does not perform duplicate checking, which should be faster (and almost always is). On the other hand, it returns more rows.
If some complex calculations are performed on these rows, this may kill all performance benefits.
Let's create the sample tables:
Read the rest of this entry »
Ordering UNION’s
Comments enabled. I *really* need your comment
From Stack Overflow:
When I optimize my 2 single queries to run in less than 0.02 seconds and then
UNION
them the resulting query takes over 1 second to run.Also, a
UNION ALL
takes longer than aUNION DISTINCT
.I would assume allowing duplicates would make the query run faster and not slower.
Am I really just better off running the 2 queries separately? I would prefer to use the
UNION
.
This kinds of things happen when the queries employ ORDER BY
along with LIMIT
.
Let's create the sample tables and see what is happening:
Read the rest of this entry »
Creating indexes
Comments enabled. I *really* need your comment
Answering the questions asked on the site.
Dima asks:
I have a single rather large table with 18 columns.
Data in a table are continuous time stamped records.
On one hand, the application is appending new records to this table; on the other hand, users can do various queries.
My question is: whether I should use composite indexes or few single column indexes, or a mix of the two?
There could be all sorts of combinations in
WHERE
clause, I don't want to limit users in their choices.Is there any common sense strategy to decide on indexes in my case?
Unfortunately you forgot to mention what RDBMS you are using, so I'll assume MySQL. The principles are all the same.
Generally speaking, an index on (column1, column2, column3)
can be used for any kind of predicate that can be represented in the following form:
ROW(@value1_start, @value2_start, @value3_start) < ROW(column1, column2, column3) < ROW(@value1_end, @value2_end, @value3_end)
In other words, if all the records are sorted by column1, column2, column3
, the condition should select a contiguous block of such a sorted set of records.
All @value
's here can match and can be +Infinity
or -Infinity
.
Let's look to some real world predicates and decide whether they are sargable (i. e. an index search can be applied against them). We assume that all columns are INTEGER
:
Read the rest of this entry »
Selecting first letters
From Stack Overflow:
I would like to produce a character list of all of the first letters of column in my database.
Is there a way to do this in MySQL?
Let's create a sample table of 1,000,000 records and fill it with random data:
Read the rest of this entry »
Assigning people
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Greg asks:
I have two tables,
tasks
andresources
.
Tasks
contains cost of each task: first requires 5 people to complete, second requires 3 people etc.
Resources
is just a list of people.How do I assign people to tasks?
This is in
MySQL
.
Thanks for a nice question, Greg.
Unfortunately, you haven't sent your table structure, so I will have to make it up.
Let's create the tables:
Read the rest of this entry »
Keeping latest rows for a group
From Stack Overflow:
Assume I have a table
foo
where I have something like this:
id, user_id, timestamp, some_value
What I want to do is remove all rows that aren't the newest
N
per user.The deletion itself could be handled by a query like this:
DELETE FROM foo WHERE id NOT IN (...)so you could rephrase the problem into this: how do I get the newest
N
(there might be less) rows for each user.This means if I have
U
users I may end up withN×U
rows, soLIMIT
won't really work.
Unfortunately, NOT IN
will not work here. MySQL doesn't allow using the target table in an IN
or NOT IN
clause in a DELETE
or UPDATE
statement.
But some time ago I posted two solutions:
- Keeping rows — how to
DELETE
all rows exceptN
- Advanced row sampling — how to select
TOP N
rows for eachGROUP
Now it's time to use these solutions together.
What we need to do here is:
- Select
DISTINCT user_id
's from the table - In a subquery,
SELECT
n'thtimestamp
and n'thid
for eachuser_id
. We need to select both these fields to handle the possible duplicates oftimestamp
's. Unfortunately, MySQL doesn't allow multiple columns in a subquery used in aSELECT
clause, that's why we need two subselects (ugly, I know) - Join the resultset with original table, for each
user_id
selecting all rows with(timestamp, id)
lesser than the(n'th timestamp, n'th id)
selected above DELETE
from the joined table. UnlikeIN
andNOT IN
conditions, using the target table in aJOIN
is OK for MySQL
Now, let's create the sample table and see how it works:
This table has 10,000 records for 5 users.
Now, let's try to delete all records except the latest 3 for each user_id
:
DELETE l.* FROM foo l JOIN ( SELECT user_id, COALESCE( ( SELECT timestamp FROM foo li WHERE li.user_id = dlo.user_id ORDER BY li.user_id DESC, li.timestamp DESC LIMIT 2, 1 ), CAST('0001-01-01' AS DATETIME)) AS mts, COALESCE( ( SELECT id FROM foo li WHERE li.user_id = dlo.user_id ORDER BY li.user_id DESC, li.timestamp DESC, li.id DESC LIMIT 2, 1 ), -1) AS mid FROM ( SELECT DISTINCT user_id FROM foo dl ) dlo ) lo ON l.user_id = lo.user_id AND (l.timestamp, l.id) < (mts, mid)
We assume here that there are no records dated earlier than 1 AD, Jan 1 and no negative id
's
Let's see what's left after the query finishes:
SELECT * FROM foo ORDER BY user_id, timestamp, id
id | user_id | timestamp | some_value | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5950 | 1 | 2009-04-25 23:56:41 | Value 5950 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
935 | 1 | 2009-04-25 23:58:52 | Value 935 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
95 | 1 | 2009-04-25 23:59:07 | Value 95 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8461 | 2 | 2009-04-25 23:57:18 | Value 8461 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4631 | 2 | 2009-04-25 23:58:14 | Value 4631 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6626 | 2 | 2009-04-25 23:59:10 | Value 6626 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7487 | 3 | 2009-04-25 23:55:41 | Value 7487 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
147 | 3 | 2009-04-25 23:56:08 | Value 147 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
27 | 3 | 2009-04-25 23:59:49 | Value 27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2698 | 4 | 2009-04-25 23:59:17 | Value 2698 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1868 | 4 | 2009-04-25 23:59:40 | Value 1868 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7938 | 4 | 2009-04-25 23:59:56 | Value 7938 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8929 | 5 | 2009-04-25 23:57:59 | Value 8929 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1934 | 5 | 2009-04-25 23:58:19 | Value 1934 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6179 | 5 | 2009-04-25 23:58:22 | Value 6179 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 rows fetched in 0.0005s (0.0020s) |
Just like requested: 15 records left, 3 latest records for each user_id
.
Selecting timestamps for a time zone
Comments enabled. I *really* need your comment
From Stack Overflow:
How can I reuse a computed column in an SQL query in MySQL?
My query is something like this:
SELECT CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) AS time_usr_tz
FROM schedule
JOIN user
ON schedule.user_id = user.user_id
WHERE CONVERT_TZ(
IF(timestamp_start > last_update, timestamp_start, last_update),
'GMT',
user.timezone
) < CURRENT_TIMESTAMP() [/sourcecode] As you can see, theCONVERT_TZ (…)
part is repeated.This is only a sample query. Actually, I have to use that computed column several times. So if I make change in one place, I have to change in many places. And the size of the query becomes scary, too.
Is there any way to avoid such duplication?
The duplication part is simple. We just need to wrap the query in a subquery or use HAVING
clause, like this:
SELECT * FROM ( SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id ) q WHERE time_usr_tz < CURRENT_TIMESTAMP()
, or this:
SELECT CONVERT_TZ( IF(timestamp_start > last_update, timestamp_start, last_update), 'GMT', user.timezone ) AS time_usr_tz FROM schedule JOIN user ON schedule.user_id = user.user_id HAVING time_usr_tz < CURRENT_TIMESTAMP()
But what I'd really like to write about today is the performance of such a query.
This query looks for the list of schedules and finds the schedules the were to be started by the time the query is issued.
The problem is that the schedule timestamps are not bound to any timezone, and should be converted to the user's timezone to find out if they're are due or not.
That can be seen as a design flaw, but in fact that can be exactly what a user needs. Say, a user wants to be woken up every day at 9:00 and reminded to take a pill at 13:00, but if she goes from Washington, D.C. to Honolulu, she may want all her alarm clocks to change from EST to HST. Instead of updating all the timestamps it's of course easier to update a single field in her profile.
The query uses columns from both tables inside a function in the WHERE
condition, that's why no indexes can be used to filter the timestamps.
If schedule
is large enough, this can be a problem for a busy service.
Now, let's create the sample tables and see what can be done. For the sake of simplicity, we'll leave only timestamp_start
column in the schedules
, which is enough to illustrate the point.
Read the rest of this entry »
Counting missing rows: using DISTINCT
Comments enabled. I *really* need your comment
This is the 2nd of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
Today, I will tell about some useful tricks that help to optimize these subqueries even more.
Let's create some sample tables:
Read the rest of this entry »
Counting missing rows
Comments enabled. I *really* need your comment
From Stack Overflow:
In my first programming job I was told that a query within the
IN()
predicate gets executed for every row contained in the parent query, and therefore usingIN
should be avoided.For example, given the query:
SELECT COUNT(*) FROM Table1 WHERE Table1Id NOT IN ( SELECT Table1Id FROM Table2 WHERE id_user = 1 )
Rows Inner subquery executions 10 10 100 100 1000 1000 10000 10000 Is this correct? How does the
IN
predicate actually work?
This is the 1st of 5 articles covering implementation of NOT IN
predicate in several RDBMS'es:
NOT IN
in MySQLNOT IN
in MySQL (usingDISTINCT
)NOT IN
in SQL ServerNOT IN
in OracleNOT IN
in PostgreSQL
In this article I will describe how MySQL implements this predicate.
Let's create the sample tables:
Read the rest of this entry »
Matching 3 of 4
Comments enabled. I *really* need your comment
From Stack Overflow:
Say I have a query like this:
SELECT * FROM my_table WHERE name = "john doe" AND phone = "8183321234" AND email = "johndoe@yahoo.com" AND address = "330 some lane"But say I only need 3 conditions out of the 4 to match.
I know I can write a very long query with several
OR
's, but I was wondering if there was a feature for this?
A nice and elegant solution proposed by Tomalak says:
SELECT * FROM my_table WHERE CASE WHEN name = "john doe" THEN 1 ELSE 0 END + CASE WHEN phone = "8183321234" THEN 1 ELSE 0 END + CASE WHEN email = "johndoe@yahoo.com" THEN 1 ELSE 0 END + CASE WHEN address = "330 some lane" THEN 1 ELSE 0 END >= 3
, but, as noted by the author, it does not use the indexes.
Since the values seem to be quite selective, it will be better to use indexes on them.
Let's create sample tables:
Read the rest of this entry »