Archive for January, 2010
Aggregates: subqueries vs. GROUP BY
From Stack Overflow:
I have a table
users
and there is a fieldinvited_by_id
showing user id of the person who invited this user.I need to make a MySQL query returning rows with all the fields from users plus a
invites_count
field showing how many people were invited by each user.
The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.
GROUP BY
The first approach is using GROUP BY
:
SELECT u.*, COUNT(ui.id) FROM users u LEFT JOIN users ui ON ui.invited_by = u.id GROUP BY u.id
This is a GROUP BY
with a self join, very simple. There are only two little things I'd like to pay some attention to.
First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN
, and use COUNT(ui.id)
instead of COUNT(*)
, because, due to the very nature of aggregation, COUNT(*)
returns at least 1 in a query with GROUP BY
, and COUNT(ui.id)
skips NULL
s (which can only result from a LEFT JOIN
miss).
Second, we group by u.id
but use u.*
in the SELECT
clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).
This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY
clause with all fields from users
if we have already grouped by the PRIMARY KEY
which is already unique. All other values from users
are uniquely defined by the PRIMARY KEY
so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.
Subquery
This solution involves correlated subqueries:
SELECT u.*, ( SELECT COUNT(*) FROM users ui WHERE ui.invited_by = u.id ) FROM users u
Here, we calculate the COUNT(*)
in a correlated subquery. This query returns exactly same records as the previous one.
Comparison
Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:
Read the rest of this entry »
Counting concurrent sessions
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Steve asks:
I am trying to query a log table for a web service application and determine how many concurrent sessions are in progress at each moment a transaction is executed, based on a start date and an elapsed time for each command executed through the web service. (These metrics are logged after the fact, I'm trying to write daily performance reporting for the site).
Here's a simplified view of my base table design:
CREATE TABLE CONNECTIONS ( USERID VARCHAR2(30), HANDLE VARCHAR2(40), PROCESS_DATE DATE, COMMAND NUMBER(6,0), COUNT NUMBER(10,0), ELAPSED_TIME NUMBER(10,0), NUM_OF_RECS NUMBER(10,0), COMMAND_TIMESTAMP TIMESTAMP (6) )
The question is: at there moment each transaction started, how many other transactions were active?
At each given moment, there is some number of active transaction. A transaction is active if the transaction begins before that moment and ends after it. This means that the moment should fall between command_timestamp
and command_timestamp + elapsed_time / 86400000
.
Database B-Tree indexes are not very good in queries that involve searching for a constant between two columns, so a self-join on the condition described above would be possible but not very efficient.
But these is a more simple solution.
Whenever a transaction starts, it increments the count of the open transactions. Whenever the transaction ends, it decrements it.
So we just can build a table of events
: starts and ends of the transactions, ordered chronologically. Each start
would be denoted with a +1, and each end
with a -1. Then we should just calculate the number of the transactions open so far and subtract the number of the transactions closed.
This can be easily done merely by calculating the partial sum of these +1's and -1's, which is an easy task for Oracle's analytic functions.
Let's create a sample table. I'll put only the relevant columns there and add a stuffing column that would emulate actual payload, to measure performance:
Read the rest of this entry »
SQL Server: running totals
From Stack Overflow:
We have a table of transactions which is structured like the following :
transactions TranxID ItemID TranxDate TranxAmt
TranxAmt
can be positive or negative, so the running total of this field (for anyItemID
) will go up and down as time goes by.Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the
TranxDate
when this occurred.Note that TranxDate is not unique.
SQL Server is a very nice system, but using it for calculating running totals is a pain.
Oracle supports additional clauses for analytic functions, RANGE
and ROWS
, which define the boundaries of the function's windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE
clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.
SQL Server's support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM
and COUNT
it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.
The common way to write such a running total query is using a subquery or a self join which would count the SUM
of all previous records. However, the complexity of this query is O(n^2)
and it's not usable for any real volumes of data.
This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.
This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.
Let's create a sample table:
Read the rest of this entry »
Cumulative values
Comments enabled. I *really* need your comment
From Stack Overflow:
There is a table that stores signal statuses that come from different devices.
- SS1 and SS2 signals are inserted to table in random times
- If either of SS1 and SS2 signal statuses is up, then the resulting signal should be up
- If both SS1 and SS2 signal statuses are down, then resulting signal should be down
I want to prepare a query that shows the result signal status changes according to SS1 and SS2 signals
Each record deals with only one signal type here: either SS1 or SS2. To obtain the signal statuses we should query the cumulative values of previous records.
If a record describes a change in SS2, we should query for the most recent change to SS1 that had been recorded so far to obtain the SS1's current status.
In systems other than Oracle, the previous value of a signal could be easily queried using subselects with TOP
/ LIMIT
clauses. But Oracle does not support correlated queries nested more than one level deep, and limiting a subquery result to a single record (which is required by a subquery) requires it (ORDER BY should be nested). This makes constructing such a subquery in Oracle quite a pain.
However, in Oracle, these things can be queries using analytics functions much more efficiently.
Let's create a sample table:
Read the rest of this entry »
Calculating mode
Comments enabled. I *really* need your comment
From Stack Overflow:
I have this query:
WITH CTE AS ( SELECT e_id, scale, ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn, COUNT(scale) OVER(PARTITION BY e_id) AS cn FROM ScoreMaster WHERE scale IS NOT NULL ) SELECT e_id, AVG(scale) AS [AVG], STDEV(scale) AS [StdDev], AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian] MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian] FROM CTE GROUP BY e_idHow do I add Mode to this query?
A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.
In other words, for each e_id
, mode is the (exact) value of scale
shared by most records with this e_id
.
Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1
and 10 other records have scale = 2
(and all other values of scale
are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.
This means that we should define a way on how to choose this mode.
There can be three approaches to this:
- Return every modal value
- Return a single modal value
- Return an aggregate of all modal values
To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »
Latest visitors
Comments enabled. I *really* need your comment
From Stack Overflow:
Say you want to display the latest visitors on a users profile page.
How would you structure this?
Perhaps a table called
uservisitors
:
uservisitors userid visitorid time And how would you select this with MySQL without any duplicates?
What I mean is if user 1 visits user 2's profile, then 5 minutes later visits it again, I don't want to show both entries: only the latest.
There are two approaches to this.
First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:
SELECT visitorid, MAX(time) AS lastvisit FROM uservisitors WHERE userid = 1 GROUP BY userid, visitorid ORDER BY lastvisit DESC LIMIT 5
However, there is a little problem with this solution.
Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY
optimization for this query, it will still have to sort on MAX(time)
to find 5 latest records.
This will require sorting the whole resultset which will be huge if the service is heavily loaded.
Let's test it on a sample table:
Read the rest of this entry »