EXPLAIN EXTENDED

How to create fast database queries

Archive for January, 2010

Aggregates: subqueries vs. GROUP BY

with one comment

From Stack Overflow:

I have a table users and there is a field invited_by_id showing user id of the person who invited this user.

I need to make a MySQL query returning rows with all the fields from users plus a invites_count field showing how many people were invited by each user.

The task seems very simple (and it is in fact), but there are at least two approaches to do it. It this article, I will discuss the benefits and the drawbacks of each approach.

GROUP BY

The first approach is using GROUP BY:

SELECT  u.*, COUNT(ui.id)
FROM    users u
LEFT JOIN
        users ui
ON      ui.invited_by = u.id
GROUP BY
        u.id

This is a GROUP BY with a self join, very simple. There are only two little things I'd like to pay some attention to.

First, we need to select all users, even those who invited no other members. An inner join would leave them out, so we use a LEFT JOIN, and use COUNT(ui.id) instead of COUNT(*), because, due to the very nature of aggregation, COUNT(*) returns at least 1 in a query with GROUP BY, and COUNT(ui.id) skips NULLs (which can only result from a LEFT JOIN miss).

Second, we group by u.id but use u.* in the SELECT clause. Every other engine would fail in this case, but MySQL allows selecting fields that are neighter grouped by nor aggregated. These fields will return an arbitrary value from any of the aggregated records (in practice, that is the record first read in its group).

This behavior is often abused, since MySQL does not guarantee what exactly it will return, but it's perfectly valid for the queries like ours. We don't need to clutter the GROUP BY clause with all fields from users if we have already grouped by the PRIMARY KEY which is already unique. All other values from users are uniquely defined by the PRIMARY KEY so there is no matter which arbitrary record will the query use to return ungrouped values: they are all same within the group.

Subquery

This solution involves correlated subqueries:

SELECT  u.*,
        (
        SELECT  COUNT(*)
        FROM    users ui
        WHERE   ui.invited_by = u.id
        )
FROM    users u

Here, we calculate the COUNT(*) in a correlated subquery. This query returns exactly same records as the previous one.

Comparison

Let's create two sample tables (a MyISAM one and an InnoDB one) and see which solution is more efficient for different scenarios:
Read the rest of this entry »

Written by Quassnoi

January 30th, 2010 at 11:00 pm

Posted in MySQL

Counting concurrent sessions

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Steve asks:

I am trying to query a log table for a web service application and determine how many concurrent sessions are in progress at each moment a transaction is executed, based on a start date and an elapsed time for each command executed through the web service. (These metrics are logged after the fact, I'm trying to write daily performance reporting for the site).

Here's a simplified view of my base table design:

CREATE TABLE CONNECTIONS
(
USERID VARCHAR2(30),
HANDLE VARCHAR2(40),
PROCESS_DATE DATE,
COMMAND NUMBER(6,0),
COUNT NUMBER(10,0),
ELAPSED_TIME NUMBER(10,0),
NUM_OF_RECS NUMBER(10,0),
COMMAND_TIMESTAMP TIMESTAMP (6)
)

The question is: at there moment each transaction started, how many other transactions were active?

At each given moment, there is some number of active transaction. A transaction is active if the transaction begins before that moment and ends after it. This means that the moment should fall between command_timestamp and command_timestamp + elapsed_time / 86400000.

Database B-Tree indexes are not very good in queries that involve searching for a constant between two columns, so a self-join on the condition described above would be possible but not very efficient.

But these is a more simple solution.

Whenever a transaction starts, it increments the count of the open transactions. Whenever the transaction ends, it decrements it.

So we just can build a table of events: starts and ends of the transactions, ordered chronologically. Each start would be denoted with a +1, and each end with a -1. Then we should just calculate the number of the transactions open so far and subtract the number of the transactions closed.

This can be easily done merely by calculating the partial sum of these +1's and -1's, which is an easy task for Oracle's analytic functions.

Let's create a sample table. I'll put only the relevant columns there and add a stuffing column that would emulate actual payload, to measure performance:
Read the rest of this entry »

Written by Quassnoi

January 25th, 2010 at 11:00 pm

Posted in Oracle

SQL Server: running totals

with one comment

From Stack Overflow:

We have a table of transactions which is structured like the following :

transactions
TranxID ItemID TranxDate TranxAmt

TranxAmt can be positive or negative, so the running total of this field (for any ItemID) will go up and down as time goes by.

Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the TranxDate when this occurred.

Note that TranxDate is not unique.

SQL Server is a very nice system, but using it for calculating running totals is a pain.

Oracle supports additional clauses for analytic functions, RANGE and ROWS, which define the boundaries of the function's windows and hence can be used to implement running totals. By default, it is just enough to omit the RANGE clause to make the analytic function apply to the window of the records selected so far, thus transforming it to a running total.

SQL Server's support for window functions only extends aggregate capabilities a little so that the aggregate can be returned along with each record that constitutes the group. For functions like SUM and COUNT it is impossible to control the window boundaries and the records order. Such analytic functions can not be used to calculate running totals.

The common way to write such a running total query is using a subquery or a self join which would count the SUM of all previous records. However, the complexity of this query is O(n^2) and it's not usable for any real volumes of data.

This is one of the few cases when the cursors are faster than a set-based solution described above. But we all are aware of the drawbacks of cursors and better search for something else.

This task, fortunately, is a little more simple than it may seem, because it deals with dates. The number of all possible dates is usually limited and a recursive query can deal with this task quite efficiently.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

January 22nd, 2010 at 11:00 pm

Posted in SQL Server

Cumulative values

Comments enabled. I *really* need your comment

From Stack Overflow:

There is a table that stores signal statuses that come from different devices.

  • SS1 and SS2 signals are inserted to table in random times
  • If either of SS1 and SS2 signal statuses is up, then the resulting signal should be up
  • If both SS1 and SS2 signal statuses are down, then resulting signal should be down

I want to prepare a query that shows the result signal status changes according to SS1 and SS2 signals

Each record deals with only one signal type here: either SS1 or SS2. To obtain the signal statuses we should query the cumulative values of previous records.

If a record describes a change in SS2, we should query for the most recent change to SS1 that had been recorded so far to obtain the SS1's current status.

In systems other than Oracle, the previous value of a signal could be easily queried using subselects with TOP / LIMIT clauses. But Oracle does not support correlated queries nested more than one level deep, and limiting a subquery result to a single record (which is required by a subquery) requires it (ORDER BY should be nested). This makes constructing such a subquery in Oracle quite a pain.

However, in Oracle, these things can be queries using analytics functions much more efficiently.

Let's create a sample table:
Read the rest of this entry »

Written by Quassnoi

January 20th, 2010 at 11:00 pm

Posted in Oracle

Calculating mode

Comments enabled. I *really* need your comment

From Stack Overflow:

I have this query:

WITH    CTE AS (
        SELECT  e_id, 
                scale, 
                ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
                COUNT(scale) OVER(PARTITION BY e_id) AS cn
        FROM    ScoreMaster
        WHERE   scale IS NOT NULL
        )
SELECT  e_id,
        AVG(scale) AS [AVG],
        STDEV(scale) AS [StdDev],
        AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [FinancialMedian]
        MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS [StatisticalMedian]
FROM    CTE
GROUP BY
        e_id

How do I add Mode to this query?

A quick reminder: in statistics, mode is the value that occurs most frequently in a data set.

In other words, for each e_id, mode is the (exact) value of scale shared by most records with this e_id.

Unlike other statistical parameters used in this query, mode is not guaranteed to have a single value. If, say, 10 records have scale = 1 and 10 other records have scale = 2 (and all other values of scale are shared by less than 10 records), then there are two modes in this set (and the set, hence, is called bimodal). Likewise, there can be trimodal, quadrimodal or, generally speaking, multimodal sets.

This means that we should define a way on how to choose this mode.

There can be three approaches to this:

  1. Return every modal value
  2. Return a single modal value
  3. Return an aggregate of all modal values

To check all queries, we will generate a simple trimodal dataset:
Read the rest of this entry »

Written by Quassnoi

January 18th, 2010 at 11:00 pm

Posted in SQL Server

Latest visitors

Comments enabled. I *really* need your comment

From Stack Overflow:

Say you want to display the latest visitors on a users profile page.

How would you structure this?

Perhaps a table called uservisitors:

uservisitors
userid visitorid time

And how would you select this with MySQL without any duplicates?

What I mean is if user 1 visits user 2's profile, then 5 minutes later visits it again, I don't want to show both entries: only the latest.

There are two approaches to this.

First one would be just aggregating the visits, finding the max time and ordering on it. Something like this:

SELECT  visitorid, MAX(time) AS lastvisit
FROM    uservisitors
WHERE   userid = 1
GROUP BY
userid, visitorid
ORDER BY
lastvisit DESC
LIMIT 5

However, there is a little problem with this solution.

Despite the fact that MySQL (with proper indexing) uses INDEX FOR GROUP-BY optimization for this query, it will still have to sort on MAX(time) to find 5 latest records.

This will require sorting the whole resultset which will be huge if the service is heavily loaded.

Let's test it on a sample table:
Read the rest of this entry »

Written by Quassnoi

January 12th, 2010 at 11:00 pm

Posted in MySQL