Archive for 2009
Selecting first letters
From Stack Overflow:
I would like to produce a character list of all of the first letters of column in my database.
Is there a way to do this in MySQL?
Let's create a sample table of 1,000,000 records and fill it with random data:
Read the rest of this entry »
Checking event dates
Comments enabled. I *really* need your comment
From Stack Overflow:
Suppose the following table structure in Oracle:
CREATE TABLE event ( id INTEGER, start_date DATE, end_date DATE )Is there a way to query all of the events that fall on a particular day of the week?
For example, I would like to find a query that would find every event that falls on a Monday.
Figuring out if the
start_date
orend_date
falls on a Monday is easy, but I'm not sure how to find it out for the dates between.
This is one of the range predicates which are very unfriendly to plain B-Tree
indexes.
But even if there would be a range friendly index (like R-Tree
), that would hardly be an improvement. Monday's make up 14.3% of all days, that means that an index if there were any, would have very low selectivity even for one-day intervals.
And if the majority of intervals last for more than one day, the selectivity of the condition yet decreases: 86% of 6-day intervals have a Monday inside.
Given the drawbacks of index scanning and joining on ROWID, we can say that a FULL TABLE SCAN
will be a nice access path for this query, and we just need to represent it as an SQL condition (without bothering for its sargability)
We could check that a Monday is between end_date
's day-of-week number and the range length's offset from this number:
SELECT * FROM "20090507_dates".event WHERE 6 BETWEEN MOD(start_date - TO_DATE(1, 'J'), 7) AND MOD(start_date - TO_DATE(1, 'J'), 7) + end_date - start_date
This query converts each ranges into a pair of zero-based, Tuesday-based day of week offsets, and returns all records which have day 6 (a Monday) inside the range.
Note that we don't use Oracle's TO_DATE('D')
function here: starting day of week depends on NLS_TERRITORY
which only leads to confusion.
Now, this query works but looks quite ugly. And if we will check for more complex conditions, it will become even uglier.
What if we need to find all ranges that contain a Friday, 13th? Or a second week's Thursday? The conditions will become unreadable and unmaintainable.
Can we do it in some more elegant way?
What if we just iterate over the days of the range and check each day for the condition? This should be much more simple than inventing the boundaries.
Let's create a sample table and try it:
Read the rest of this entry »
Oracle: ROW_NUMBER vs ROWNUM
From Stack Overflow:
I have an SQL query that looks something like this:
SELECT * FROM ( SELECT t.*, row_number() OVER (ORDER BY ID) rn FROM mytable t ) WHERE rn BETWEEN :start and :endEssentially, it's the
ORDER BY
part that's slowing things down. If I were to remove it, theEXPLAIN
cost goes down by an order of magnitude (over 1,000 times).I've tried this:
SELECT t.*, row_number() OVER (ORDER BY ID) rn FROM mytable t WHERE rownum BETWEEN :start and :end, but this doesn't give correct results.
Is there any easy way to speed this up? Or will I have to spend some more time with the
EXPLAIN
tool?
First, just a quick reminder on how ROWNUM
works. From Oracle's documentation:
Conditions testing for
ROWNUM
values greater than a positive integer are always false. For example, this query returns no rows:SELECT * FROM employees WHERE ROWNUM > 1The first row fetched is assigned a
ROWNUM
of 1 and makes the condition false. The second row to be fetched is now the first row and is also assigned aROWNUM
of 1 and makes the condition false. All rows subsequently fail to satisfy the condition, so no rows are returned.
That's why the second query should look like this:
SELECT * FROM ( SELECT t.*, ROWNUM AS rn FROM mytable t ORDER BY paginator, id ) WHERE rn BETWEEN :start and :end
Now, let's see the performance. To do this, we'll create a sample table:
Read the rest of this entry »
PostgreSQL: row numbers
Note: this article concerns PostgreSQL 8.3 and below.
PostgreSQL 8.4 introduces window functions.
Window function ROW_NUMBER()
implements the functionality in question more efficiently.
ROWNUM
is a very useful pseudocolumn in Oracle that returns the position of each row in a final dataset.
Upcoming PostgreSQL 8.4 will have this pseudocolumn, but as for now will we need a hack to access it.
The main idea is simple:
- Wrap the query results into an array
- Join this array with a
generate_series()
so that numbers from 1 toarray_upper()
are returned - For each row returned, return this number (as
ROWNUM
) along the corresponding array member (which is the row from the original query)
Let's create a table with multiple columns of different datatypes, write a complex query and try to assign the ROWNUM
to the query results:
Read the rest of this entry »
Assigning people
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Greg asks:
I have two tables,
tasks
andresources
.
Tasks
contains cost of each task: first requires 5 people to complete, second requires 3 people etc.
Resources
is just a list of people.How do I assign people to tasks?
This is in
MySQL
.
Thanks for a nice question, Greg.
Unfortunately, you haven't sent your table structure, so I will have to make it up.
Let's create the tables:
Read the rest of this entry »
PostgreSQL: optimizing DISTINCT
In PostgreSQL (as of 8.3, at least), performance of DISTINCT
clause in SELECT
list is quite poor.
Probably because DISTINCT
code in PostgreSQL is very, very old, it always acts in same dumb way: sorts the resultset and filters out the duplicate records.
GROUP BY
that can be used for the same purpose is more smart, as it employs more efficient HashAggregate
, but its performance is still poor for large dataset.
All major RDBMS
's, including MySQL, are able to jump over index keys to select DISTINCT
values from an indexed table. This is extremely fast if there are lots of records in a table but not so many DISTINCT
values.
This behavior can be emulated in PostgreSQL too.
Let's create a sample table:
Read the rest of this entry »
GROUP_CONCAT in PostgreSQL without aggregate functions
In one of the previous articles:
, I described an aggregate function to concatenate strings in PostgreSQL, similar to GROUP_CONCAT
in MySQL
.
It's very useful if you have a complex GROUP BY
query with multiple conditions.
But for some simple queries it's possible to emulate GROUP_CONCAT
with pure SQL
, avoiding custom functions at all.
Let's create a table to demonstrate our task:
Read the rest of this entry »
Indexing VARCHAR(MAX)
SQL Server 2005 introduced new datatype: VARCHAR(MAX).
This is a replacement of old TEXT
that can do anything that TEXT
could, and something that TEXT
could not.
Something
includes possibility of being queried and compared like a plain VARCHAR
. A really nice feature.
SELECT * FROM [20090501_max].t_bigdata WHERE value LIKE 'lorem ipsum dolor sit%'
Unfortunately, VARCHAR(MAX)
is still far from a plain VARCHAR
. For instance, it cannot serve as an index key. It can be a part of an index (the column is included to the index leaves to avoid table lookups when using the index), but not the key of an index (the index leaves are not sorted on this column).
This means that the queries using LIKE
predicate on this column will be slow.
Let's create a sample table:
Read the rest of this entry »
Rating movies
Comments enabled. I *really* need your comment
From Stack Overflow:
I have the following query in SQL Server:
SELECT TOP 50 CustomerID FROM Ratings WHERE CustomerID != 915 AND MovieID IN ( SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915 ) GROUP BY CustomerID ORDER BY COUNT(*) DESCIt is super fast, but when I try to use it in a subquery like this:
SELECT * FROM Ratings WHERE MovieID = 1 AND CustomerID IN ( SELECT TOP 50 CustomerID FROM Ratings WHERE CustomerID != 915 AND MovieID IN ( SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915 ) ) GROUP BY CustomerID ORDER BY COUNT(*) DESC, it's getting slow.
Any ideas on why this is so slow and how I can speed it up?
My
PRIMARY KEY
is(MovieID, CustomerID)
and I added a index onCustomerID
This query is in fact a simple but efficient movie rating assistant.
When a user searches for a movie, it shows him ratings other users gave to this movie, but only from those people who mostly watches and rates the same movies as the user in question.
This is quite a fair rating system, as if a user is into Attack of the Killer Tomatoes!
and Surf Nazis Must Die
, his taste correlates with taste of people who watches and rates the same stuff better than with one of those who watches Titanic
Now, returning to the query. To build an advice on movie 1 for user 915, we need to do the following:
- Build a list of movies that the user 915 rated
- Build the list of users who have watched movie 1 and have rated the same movies as user 915 had
- Select
TOP 50
users who had rated most movies of those user 915 had also rated - Show their opinions on movie 1
If we look into this plan, we will see two search conditions. On step 1, we need to select movie
's for a given user
, while on step 2 we need to select user
's for a given movie
.
Since we don't need any other fields on steps 1 and 2, this query will benefit from creating two composite indexes of the same set of columns: MovieID
and CustomerID
, in different order.
Since we already have a PRIMARY KEY
on (MovieID, CustomerID)
, we'll need to add just one more index on (CustomerID, MovieID)
.
And since these pairs are unique, we will also make this index UNIQUE
. (My article Making an index UNIQUE describes why you should always declare UNIQUE
index on intrinsically unique data.)
Having two UNIQUE
indexes on the same set of columns may seem a redundancy. And in fact, it is.
But let's remember that any index is a redundancy. An index does not keep any data not initially contained in the table, it just arranges it so that it can be accessed faster.
So if two indexes will make our query even more fast, then hail redundancy!
Let's create the sample table:
Read the rest of this entry »
Selecting N records for each group: PostgreSQL
Comments enabled. I *really* need your comment
In one of the previous articles:
, I described how to select a certain number of records for each group in a MySQL table.
This is trivial in SQL Server and Oracle, since an analytic function ROW_NUMBER()
can be used to do this.
Now, I'll describe how to do it in PostgreSQL.
We are assuming that the records from the table should be grouped by a field called grouper
, and for each grouper
, N
first records should be selected, ordered first by ts
, then by id
.
Let's create a sample table:
Read the rest of this entry »