Archive for July, 2009
MySQL: counting items in pairs
Comments enabled. I *really* need your comment
From Stack Overflow:
I currently have an
item
table, and apair
table.The
pair
table simply contains two columns, which contain thePRIMARY KEY
from theitem
table. There is an index onid1, id2
onpair
.A common query is to find a number of items that are featured in the least number of pairs:
SELECT id, COUNT(*) AS count FROM item i LEFT JOIN pair p ON (p.id1 = i.id OR p.id2 = i.id) GROUP BY id ORDER BY count, RAND() LIMIT 100, but the query is horible performance wise.
Is there a better query, and/or data structure for this type of thing?
MySQL is not very good in optimizing OR
conditions.
It's capable of doing index_merge
, which would UNION
the results of two indexes, however, as documentation states, this access method works only when comparing the fields against the constants.
This method, therefore, can be used to serve an OR
condition neither in a JOIN
nor in a subquery.
Let's create sample tables and see the execution plan for the query:
PostgreSQL: making duplicate records unique
Comments enabled. I *really* need your comment
From Stack Overflow:
In an existing application, I have a table which has no primary key, which (rarely) has duplicate rows in it.
For integration with another system, I need to add a column to the table that makes these duplicate rows unique.
The table is essentially:
txn# detail# amount I could just use an incrementing
seq#
for every existing row, however, the application changes will be to take the existingalmost key(basically(txn#, detail#)
), and increment the sequence number for this combinationSo, e. g., I'll have two rows for
(txn#, detail#) = (1, 1)
, and haveseq#
1 for the first and 2 for the second.If
(txn#, detail#) = (513, 44)
has 3 duplicate rows, these would haveseq#
1, 2 and 3 appropriately.
We just need to update the table with a condition that would distinguish between the rows somehow.
Fortunatly, every PostgreSQL table has an implicit primary key, ctid
.
In fact, it's the pointer to the place in the datafile where the row data resides.
In new PostgreSQL 8.4, we can employ a window function to assign a seq
to each row.
This script fills the table with duplicate values (two duplicates for each (txn, detail)
pair):
CREATE TABLE t_dup (txn INT NOT NULL, detail INT NOT NULL, seq INT); CREATE INDEX ix_dup_txn_detail ON t_dup (txn, detail); INSERT INTO t_dup (txn, detail, seq) SELECT t, d, NULL FROM generate_series(1, 100) t, generate_series(1, 1000) d, generate_series (1, 2) s; ANALYZE t_dup;
Here's the query to update it:
UPDATE ONLY t_dup SET seq = rn FROM ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY txn, detail) AS rn FROM t_dup ) d WHERE t_dup.ctid = d.ctid
This works perfectly and completes in 9 seconds, but, unfortunately, it's unavailable in older versions of PostgreSQL.
For PostgreSQL 8.3 and below, we need to use a subselect with a COUNT(*)
.
Fortunately, ctid
s are comparable, and we can use less than
(<
) operator on them to build an ordered set.
This is less efficient, but still works (if the keys are almost unique
, i. e. there are not many duplicates of each key):
UPDATE ONLY t_dup d SET seq = ( SELECT COUNT(*) FROM t_dup di WHERE di.txn = d.txn AND di.detail = d.detail AND di.ctid <= d.ctid );
This takes a little bit longer (15 seconds), however still works.
PostgreSQL 8.4: efficient MIN(COUNT(*))
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Frances asks:
I have a table
user_views
which contains pages viewed by users.How do I select the user that had the most and the fewest number of page views?
If you need them at the same time, you'll just have to group by myuser
, then order by COUNT(*)
.
With new PostgreSQL 8.4, it's possible to do this in one CTE, to avoid double evaluation:
WITH q AS ( SELECT myuser, COUNT(*) FROM user_pages GROUP BY myuser ) SELECT * FROM ( SELECT 'max'::TEXT AS which, myuser, count FROM q ORDER BY count DESC LIMIT 1 ) qmax UNION ALL SELECT * FROM ( SELECT 'min'::TEXT AS which, myuser, count FROM q ORDER BY count LIMIT 1 ) qmin
However, it's takes quite a long time if you table is big.
This is probably an answer to the question you asked.
But as a bonus I'd like to tell how to optimize the query for the cases when you need just the user with minimal count of page views.
The main idea here that we should accumulate the least value of COUNT(*)
calculated so far, and stop counting when we reach this threshold. This will save us some row lookups and improve the query.
Let's create a sample table:
SQL Server: random records avoiding CTE reevaluation
Comments enabled. I *really* need your comment
From Stack Overflow:
I need to get 5 random records from a table plus a further record based on data from the users preferences.
How can I combine the two statements whilst ensuring that no results are repeated (i. e. the favourite is not present in the 5 random records)?
We should select the 6th record so that it would satisfy two conditions: first, it should be preferred by user, second, it should not be among the 5 random records selected earlier.
The problem with random records it that they are, um, random. We usually cannot reproduce 5 random records to apply further conditions on them.
That's where the CTEs would come handy, but there's a little problem.
A CTE, as Microsoft understands them, is just an alias for an online view.
Therefore, each time a CTE is used in a query, it may (or may not) be reevaluated.
It's not a problem for a plain CTEs which are built out of static columns.
However, this is totally a problem when we try to UNION
two queries using an undeterministic CTE (like, ordered by NEWID()
), as we would do in this case.
Let's create a sample table and see what happens:
Dynamic pivot
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a table and want to transpose its rows to columns, similar to a pivot table but without summarising.
For example I have the following tables:
Question (QuestionID, QuestionText) Response (ResponseID, ResponseText, QuestionID)Basically I want to be able to create a dynamic table something like:
Question 1 Question 2 Question 3 Response 1.1 Response 1.2 Response 1.3 Response 2.1 Response 2.2 Response 2.3 Response 3.1 Response 3.2 Response 3.3 Response 4.1 Response 4.2 Response 4.3 The main requirement would be I don't know at design time what the question text will be.
First, let's put it straight: SQL is not a right tool to do this.
SQL operates on sets. An SQL query is a functions that returns a set. And the column layout is an implicit property of a set. It should be defined in design time. This is the way SQL works.
However, there are several approaches that can simplify the task. One of them is using SQL Server's XML abilities.
To arrange the answers in a table, we need to make the following:
- Select all questions. This rowset will define the columns of our table.
- Select the row numbers for all answers for the question that has the most answers. This rowset will define the rows of our table. The number of rows wil be exactly as the number of answers to the most answered question.
- Assign the ordinal number to each answer on any question. This is done using
ROW_NUMBER()
. This value will uniquely define the position of the answer in a table. - For each row (taken from the rowset made on step 2), take all the questions and
LEFT JOIN
the answers that need to be put into this row, using the ordinal number calculated on the step 3 as aLEFT JOIN
condition. This will put either the answer or an empty string into the corresponding cell. - Transform the rowset retrieved on step above into an XML string made out of
<td>
's and enclosed into a<tr>
- Transform the resulting rowset into an XML made out of the XML's from the previous step and enclosed into a
<table>
Let's create a sample table and see how can we do this:
MySQL: selecting rows before and after filtered one
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a basic MySQL table,
terms
, comprised of anid
andterm
field.I want to create an alphabetically sorted dictionary index (in the literal sense), that would list ten 10 terms above the selected term, and 20 below it
An example of this could be found on Urban Dictionary where on the left column you see the current term highlighted, and a number of terms above it, and some below, all sorted alphabetically.
As we all know, MySQL doesn't support a
ROW_NUMBER()
or a similar function so we end up resorting to user variables and sub-selects.I also cannot create a view with user defined variables because MySQL doesn't allow that.
MySQL indeed doesn't support ROW_NUMBER()
and this function indeed can be emulated using session variables:
However, it will be an overkill in this case, since a more simple and more efficient set-based solution using LIMIT
can be used instead.
Let's create a sample table and see how:
MySQL: grouping continuous ranges
From Stack Overflow:
Given the following table:
ID State Date 12 1 2009-07-16 10:00 45 2 2009-07-16 13:00 67 2 2009-07-16 14:40 77 1 2009-07-16 15:00 89 1 2009-07-16 15:30 99 1 2009-07-16 16:00 , how can I group by the field
state
, while still maintaining the borders between the state changes?I need
MIN(id)
,MIN(date)
andCOUNT(*)
for each group, and this is expected:
ID State Date Count 12 1 2009-07-16 10:00 1 45 2 2009-07-16 13:00 2 77 1 2009-07-16 15:00 3
Though a pure set-based solution exists, in MySQL, this can be easily done using session variables, which is more efficient.
Let's create a sample table:
PostgreSQL: last 12 weeks of a year
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a query like this:
SELECT EXTRACT(WEEK FROM j.updated_at) as "week", count(j.id) FROM jobs WHERE EXTRACT(YEAR FROM j.updated_at)=2009 GROUP BY EXTRACT(WEEK FROM j.updated_at) ORDER BY week, which works fine, but I only want to show the last 12 weeks.
LIMIT 12
works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.
To select last 12 rows in ascending order it's enough to select first 12 rows in descending order and resort them in a subquery:
SELECT * FROM ( SELECT EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt FROM jobs WHERE EXTRACT(year FROM updated_at) = 2008 GROUP BY week ORDER BY week DESC LIMIT 12 ) q ORDER BY week
However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.
More than that, EXTRACT(year FROM updated)
is not a sargable predicate, and all table rows (or index rows) need to be examined.
On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:
SELECT * FROM ( SELECT EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt FROM t_week WHERE EXTRACT(year FROM updated) = 2008 GROUP BY week ORDER BY week DESC LIMIT 12 ) q ORDER BY week
week | cnt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | 3584 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42 | 3467 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
43 | 3499 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
44 | 3535 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
45 | 3511 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
46 | 3621 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
47 | 3529 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
48 | 3500 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
49 | 3415 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 | 3536 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
51 | 3504 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
52 | 3479 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 rows fetched in 0.0004s (3.5843s) |
Sort (cost=58517.44..58517.47 rows=11 width=16) Sort Key: (date_part('week'::text, t_week.updated)) -> Limit (cost=58517.11..58517.14 rows=11 width=8) -> Sort (cost=58517.11..58517.14 rows=11 width=8) Sort Key: (date_part('week'::text, t_week.updated)) -> HashAggregate (cost=58516.75..58516.92 rows=11 width=8) -> Seq Scan on t_week (cost=0.00..58491.75 rows=5000 width=8) Filter: (date_part('year'::text, updated) = 2008::double precision)
However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.
Let's create a sample table and see how to do this:
Hierarchial queries in MySQL: identifying trees
Comments enabled. I *really* need your comment
Continuing on the theme of hierarchical queries in MySQL:
- Hierarchical queries in MySQL
- Hierarchical queries in MySQL: adding level
- Hierarchical queries in MySQL: adding ancestry chains.
- Hierarchical queries in MySQL: finding leaves
- Hierarchical queries in MySQL: finding loops
- Hierarchical data in MySQL: parents and children in one query
Assume we have a table with hierarchical structure like this:
treeitem | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 rows fetched in 0.0003s (0.0137s) |
We have two trees here: one starting from 1, another one starting from 2.
The problem is: given any item, we should identify the whole tree this item belongs to, and return the whole tree in the hierarchical order.
This also can be easily done using hierarchical queries in MySQL.
In this article: Hierarchical queries in MySQL I shown how to implement a function that returnes tree items in correct order, being called sequentially.
This function is reentrable and keeps its state in session variables, one of which, @start_with
, defines the parent element for the tree we want to build.
We have two problems here:
- Given an item, define a root of the tree it belongs to
- Build a whole tree, starting from the root
The first problem can be solved by iterating the linked list backwards, starting from the variable given as an input.
This article:
describes how to do it in great detail, that's why I'll just put a query here.
Let's create the table described above:
PostgreSQL: round-robin order
Comments enabled. I *really* need your comment
From Stack Overflow:
I have a bunch of URLs stored in a table waiting to be scraped by a script.
However, many of those URLs are from the same site. I would like to return those URLs in a
site-friendlyorder (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many HTTP requests in a short time.The database layout is something like this:
create table urls ( site varchar, -- holds e.g. www.example.com or stockoverflow.com url varchar unique );Example result:
SELECT url FROM urls ORDER BY mysterious_round_robin_function(site)
url http://www.example.com/some/file http://stackoverflow.com/questions/ask http://use.perl.org/ http://www.example.com/some/other/file http://stackoverflow.com/tags
To solve this task, we just need to assign a number to an URL within each site, and order by this number first, then by site, like this:
number | site | url |
---|---|---|
1 | example.com | http://www.example.com/some/file |
1 | stackoverflow.com | http://stackoverflow.com/questions/ask |
1 | use.perl.org | http://use.perl.org/ |
2 | example.com | http://www.example.com/some/other/file |
2 | stackoverflow.com | http://stackoverflow.com/tags |
Good news: in new PostgreSQL 8.4, we just can use a window function to do this:
SELECT url FROM urls ORDER BY ROW_NUMBER() OVER (PARTITION BY site ORDER BY url), site
url | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a.example.com/url104 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b.example.com/url1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c.example.com/url106 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
y.example.com/url102 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
z.example.com/url103 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a.example.com/url130 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b.example.com/url105 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c.example.com/url132 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
l.example.com/url999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
m.example.com/url974 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1000 rows fetched in 0.0169s (0.0178s) |
Yes, that simple.
Bad news: in PostgreSQL 8.3 and below, we still need to hack.
Let's create a sample table and see how: