Author Archive
Fallback language names: PostgreSQL
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
A quick reminder of the problem taken from Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Efficiency of each of these method depends of the fallback probability (how many items are covered by the localization).
If the localization is poor and but few terms are translated into the local language, the probability of the fallback is high. I took Latin language as an example for this.
If almost all terms are translated, the probability of fallback is low. In this case, I took French as an example (as in the original quiestion), since it is widely used and localizations are likely to cover most terms.
In previous articles I shown that in both Oracle and SQL Server, the second method (two LEFT JOIN
s) is more efficient to query poorly localized languages, while for well-localized languages the third query should be used, i. e. a LEFT JOIN
for the local language and a subquery for the fallback one.
Now, let's create sample tables and see how these queries behave in PostgreSQL:
Read the rest of this entry »
Fallback language names: SQL Server
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
From Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Yesterday, we saw that the second option (two LEFT JOIN
s) is best in Oracle if the probability of the fallback is high (i. e. the localization is poor and but few terms are translated into the local language). I took Latin language as an example for this.
Third query (a LEFT JOIN
with the local language rowset and a subquery in a COALESCE
to retrive the fallback language names) is best if there is a low probability of a fallback, i. e. almost all terms are translated. In this case, I took French as an example, since it is widely used and localizations are likely to cover most terms.
Today we will see how same queries work in SQL Server.
Let's create sample tables:
Fallback language names: Oracle
Comments enabled. I *really* need your comment
This is a series of articles on efficient querying for a localized name, using a default (fallback) language if there is no localized name:
- Fallback language names: Oracle
- Fallback language names: SQL Server
- Fallback language names: PostgreSQL
- Fallback language names: MySQL
From Stack Overflow:
I have table
item
and another tablelanguage
which contains names for theitems
in different languages:
item language data How do I select a French name for an item if it exists, or a fallback English name if there is no French one?
This is quite a common database design (most often used for localization), so I'd like to cover this question for different databases.
This design is not limited to languages and localization: it may be used for any data with default value that can be overriden. Prices, discounts, options — many kinds of things.
Today we will see how do it in Oracle.
We basically have three options here:
- Use
COALESCE
on twoSELECT
list subqueries - Use
COALESCE
on the results of twoLEFT JOINS
- Use the combination of methods above: a
LEFT JOIN
for French names and a subquery for English ones
Let's create sample tables and see what is more efficient.
Read the rest of this entry »
MySQL: LIMIT on LIMIT
Comments enabled. I *really* need your comment
From Stack Overflow:
Here's my situation: I want to select first ten entries from a database with
id = $id
.But I want the results to be listed in a certain priority: if
criteria = $criteria
, then I want those results displayed first.Otherwise, I just want to keep displaying the rest of the rows.
This can be done with a very simple query:
SELECT * FROM t_preference ORDER BY category = 1 DESC, id DESC LIMIT 10
In this query, category = 1
is a boolean expression which returns TRUE
when category
is equal to 1, FALSE
otherwise.
Since TRUE
is 1 and FALSE
is 0, TRUE
is greater than FALSE
from ORDER BY
's point of view. We should apply DESC
for the TRUE
results (i. e. rows with category = 1
) to be returned first.
However, this query is not very index friendly, and it will need to select all rows and sort them.
To improve performance, we should split this query into two index-friendly queries and merge them with a UNION
.
Let's create a sample table and see how do we do it:
SQL Server: deleting adjacent values (improved)
Comments enabled. I *really* need your comment
Yesterday, I wrote an article on how to delete adjacent values from a table in SQL Server:
A quick reminder of the problem, taken from Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)
are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
To delete adjacent values, we should select continous ranges of interval
s and delete each even row from each continous range.
In the previous article, I used a subquery to do this:
WITH rows AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ( SELECT TOP 1 qi.id AS mint FROM [20090803_adjacent].records qi WHERE qi.unitid = qo.unitid AND qi.day = qo.day AND qi.interval < = qo.interval AND NOT EXISTS ( SELECT NULL FROM [20090803_adjacent].records t WHERE t.unitid = qi.unitid AND t.day = qi.day AND t.interval = qi.interval - 1 ) ORDER BY qi.interval DESC ) ORDER BY interval ) AS rnm FROM [20090803_adjacent].records qo ) DELETE FROM rows WHERE rnm % 2 = 0
This subquery traverses adjacent intervals until it find the first interval that has no predecessor.
Its id
then serves as a range identifier in a PARTITION BY
clause.
However, this query can be improved.
Assume we have the following intervals:
interval |
---|
3 |
4 |
5 |
6 |
10 |
11 |
12 |
20 |
21 |
We have three ranges here: 3:6, 10:12 and 20:21.
To identify a continuous range of intervals, we first should calculate a per-partition ROW_NUMBER
of these intervals:
interval | ROW_NUMBER |
---|---|
3 | 1 |
4 | 2 |
5 | 3 |
6 | 4 |
10 | 5 |
11 | 6 |
12 | 7 |
20 | 8 |
21 | 9 |
ROW_NUMBER
s, by definition, don't have gaps within the partition, while the interval
s can have gaps. However, inside each range ROW_NUMBER
s increase with interval
s.
This means that we can just subtract ROW_NUMBER
from the interval
:
interval | ROW_NUMBER | ROW_NUMBER - interval |
---|---|---|
3 | 1 | 2 |
4 | 2 | 2 |
5 | 3 | 2 |
6 | 4 | 2 |
10 | 5 | 5 |
11 | 6 | 5 |
12 | 7 | 5 |
20 | 8 | 12 |
21 | 9 | 12 |
Within each range, this difference is a constant. When a gap occurs, this difference changes.
Therefore, the value of this difference identifies a range within each partition, and can be used for partitioning ranges just as well.
This should be more efficient since it does not use the subqueries.
Let's create a sample table and check:
SQL Server: deleting adjacent values
Comments enabled. I *really* need your comment
From Stack Overflow:
I am trying to selectively delete records from an SQL Server 2005 table without looping through a cursor.
The table can contain many records (sometimes more than 500,000) so looping is too slow.
id unitId day interval amount 1 100 10 21 9.345 2 100 10 22 9.367 3 200 11 21 4.150 4 300 11 21 4.350 5 300 11 22 4.734 6 300 11 23 5.106 7 400 13 21 10.257 8 400 13 22 10.428 Values of
(UnitID, Day, Interval)
are unique.In this example I wish to delete records 2, 5 and 8, since they are adjacent to an existing record (based on the key).
Note: record 6 would not be deleted because once 5 is gone it is not adjacent any longer.
The result of the query should be a table where no adjacent records exist, i. e. nothing should be returned after we run the following query:
SELECT * FROM mytable qi WHERE EXISTS ( SELECT NULL FROM mytable t WHERE t.unitid = qi.unitid AND t.day = qi.day AND t.interval = qi.interval - 1 )
However, we should leave as many records as possible, so that we delete only the records absolutely necessary to delete.
If we just delete everything, the query above will not return any records just as well, but it's not our goal.
The problem is that the records we delete can influence adjacency of other records. In the example above, record 6 is adjacent to 5, and record 5 is adjacent to 4.
So we could delete 4 and 6 (and leave only 5), and at the same time we could delete only 5 (and leave 4 and 6 which are not adjacent anymore since the 5 has been deleted).
In SQL Server, no DML query can see its own results, so a DELETE a with a naive EXISTS condition would delete 6 just as well, since it would not notice that 5 was deleted beforehand.
However, we can formulate the delete condition.
We need to split our records into the sets of continous ranges, without gaps within any range. All adjacent records should go into one range.
The first record of any range should never be deleted, since it is not adjacent to any record before it (if there were any record before this, that record would go into the range before the record in question, and the latter would not be first in its range).
The second record of any range should be deleted, since it's adjacent to the first record (that is guaranteed to stay in the table, as we decided a step earlier).
The third record should not be deleted, since it was adjacent to the second record which is doomed, and after the second record will be deleted, the third record will not be adjacent anymore, etc.
In other words, we should delete each even record from each continuous range. This algorithm guarantees that no adjacent records will be left and that we delete as few records as possible.
The only problem is to group the adjacent records and find their row numbers within each range so that we can filter the even rows.
It is best done by iterating the records up from the current and returning the id
of the first matching record which is not superseded (i. e. for which there is no record with (uintId, day, interval - 1)
. This can be done in a subquery right in the PARTITION BY
clause of the ROW_NUMBER
function.
Let's create a sample table and compose the query:
MySQL: DISTINCT on DISTINCT
Comments enabled. I *really* need your comment
From Stack Overflow:
I currently have the following query:
SELECT group_concat (DISTINCT usrFirst, usrLast) as receiver_name FROM …It outputs my list of names as follows:
JohnDoe,BillSmith,DaveJones, while desired ouput would be:
John Doe, Bill Smith, and Dave JonesBasically, I need help figuring out three things:
- How can I put a space, between the first and last name?
- How can insert a space after each comma?
- How can I add an
andright before the last name displayed?
This kind of output should be really processed on the client (since comma and the and
word are very cultire specific).
From MySQL, we just need a recordset of distinct full names.
Let's create a sample table and see how do we do it:
MySQL: counting items in pairs
Comments enabled. I *really* need your comment
From Stack Overflow:
I currently have an
item
table, and apair
table.The
pair
table simply contains two columns, which contain thePRIMARY KEY
from theitem
table. There is an index onid1, id2
onpair
.A common query is to find a number of items that are featured in the least number of pairs:
SELECT id, COUNT(*) AS count FROM item i LEFT JOIN pair p ON (p.id1 = i.id OR p.id2 = i.id) GROUP BY id ORDER BY count, RAND() LIMIT 100, but the query is horible performance wise.
Is there a better query, and/or data structure for this type of thing?
MySQL is not very good in optimizing OR
conditions.
It's capable of doing index_merge
, which would UNION
the results of two indexes, however, as documentation states, this access method works only when comparing the fields against the constants.
This method, therefore, can be used to serve an OR
condition neither in a JOIN
nor in a subquery.
Let's create sample tables and see the execution plan for the query:
PostgreSQL: making duplicate records unique
Comments enabled. I *really* need your comment
From Stack Overflow:
In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.
For integration with another system, I need to add a column to the table that makes these duplicate rows unique.
The table is essentially:
txn# detail# amount I could just use an incrementing
seq#
for every existing row, however, the application changes will be to take the existingalmost key(basically(txn#, detail#)
), and increment the sequence number for this combinationSo, e. g., I'll have two rows for
(txn#, detail#) = (1, 1)
, and haveseq#
1 for the first and 2 for the second.If
(txn#, detail#) = (513, 44)
has 3 duplicate rows, these would haveseq#
1, 2 and 3 appropriately.
We just need to update the table with a condition that would distinguish between the rows somehow.
Fortunatly, every PostgreSQL table has an implicit primary key, ctid
.
In fact, it's the pointer to the place in the datafile where the row data resides.
In new PostgreSQL 8.4, we can employ a window function to assign a seq
to each row.
This script fills the table with duplicate values (two duplicates for each (txn, detail)
pair):
CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT); CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail); INSERT INTO t_dup (txn, detail, seq) SELECT t, d, NULL FROM generate_series(1, 100) t, generate_series(1, 1000) d, generate_series (1, 2) s; ANALYZE t_dup;
Here's the query to update it:
UPDATE ONLY t_dup SET seq = rn FROM ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn FROM t_dup ) d WHERE t_dup.ctid = d.ctid
This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.
For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*)
.
Fortunately, ctid
s are comparable, and we can use less than
(<
) operator on them to build an ordered set.
This is less efficient, but still works (if the keys are almost unique
, i. e. there are not many duplicates of each key):
UPDATE ONLY t_dup d SET seq = ( SELECT COUNT(*) FROM t_dup di WHERE di.txn = d.txn AND di.detail = d.detail AND di.ctid <= d.ctid );
This takes a little bit longer (15 seconds), however still works.
PostgreSQL 8.4: efficient MIN(COUNT(*))
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Frances asks:
I have a table
user_views
which contains pages viewed by users.How do I select the user that had the most and the fewest number of page views?
If you need them at the same time, you'll just have to group by myuser
, then order by COUNT(*)
.
With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:
WITH q AS ( SELECT myuser, COUNT(*) FROM user_pages GROUP BY myuser ) SELECT * FROM ( SELECT 'max'::TEXT AS which, myuser, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, myuser, count FROM q ORDER BY count LIMIT 1 ) qmin
However, it's takes quite a long time if you table is big.
This is probably an answer to the question you asked.
But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.
The main idea here that we should accumulate the least value of COUNT(*)
calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.
Let's create a sample table: