EXPLAIN EXTENDED

How to create fast database queries

My latest article on SQL in general: 5 Claims About SQL, Explained. You're welcome to read and comment on it.

Converting currencies

Comments enabled. I *really* need your comment

When developing a financial application, one needs to deal with currencies and convert from one currency to another.

There may be different scenarios, but for accounting purposes the currency rates set and published by appropriate official regulator once a day are used most often.

A typical task looks like this: given a set of transactions in one currency, get an appropriate amount in another currency.

Let's create sample tables and see how it may be done:

CREATE TABLE t_transaction (
        xac_id INT NOT NULL,
        xac_currency INT NOT NULL,
        xac_amount NUMBER(10, 2) NOT NULL,
        xac_date DATE NOT NULL
)
/
CREATE TABLE t_rate (
        rte_currency INT NOT NULL,
        rte_date DATE NOT NULL,
        rte_rate NUMBER(10, 4) NOT NULL
)
/
ALTER TABLE t_transaction
ADD CONSTRAINT pk_transaction_id
PRIMARY KEY (xac_id)
/
CREATE INDEX ix_transaction_date ON t_transaction (xac_date)
/
ALTER TABLE t_rate
ADD CONSTRAINT pk_rate_currency_date
PRIMARY KEY (rte_currency, rte_date)
/
BEGIN
        DBMS_RANDOM.seed(20090518);
END;
/
INSERT
INTO    t_transaction (xac_id, xac_currency, xac_amount, xac_date)
SELECT  level,
        DBMS_RANDOM.value() + 1,
        FLOOR(DBMS_RANDOM.value() * 1000000 + 1) / 100,
        TO_DATE('27.05.2009', 'dd.mm.yyyy') - (level / 1440)
FROM    dual
CONNECT BY
        level <= 1000000
/
INSERT  ALL
INTO    t_rate (rte_currency, rte_date, rte_rate)
VALUES  (
        1,
        TO_DATE('27.05.2009', 'dd.mm.yyyy') - lvl,
        25 + CAST((DBMS_RANDOM.value() - 0.5) * 30000 AS INT) / 10000
        )
INTO    t_rate (rte_currency, rte_date, rte_rate)
VALUES  (
        2,
        TO_DATE('27.05.2009', 'dd.mm.yyyy') - lvl,
        30 + CAST((DBMS_RANDOM.value() - 0.5) * 30000 AS INT) / 10000
        )
SELECT  level AS lvl
FROM    dual
WHERE   MOD(level, 7) NOT IN (3, 4)
CONNECT BY
        level <= 1000
/
COMMIT
/
BEGIN
        DBMS_STATS.gather_schema_stats('"20090527_rates"');
END;
/

There are 1,000,000 transactions in 2 currencies, one transaction per minute, with amounts randomly distributed from 0.01 to 9999.99:

SELECT  *
FROM    t_transaction
WHERE   rownum <= 10

XAC_ID XAC_CURRENCY XAC_AMOUNT XAC_DATE
1 2 5527,24 26.05.2009 23:59:00
2 1 7066,16 26.05.2009 23:58:00
3 2 6849,27 26.05.2009 23:57:00
4 1 5654,12 26.05.2009 23:56:00
5 2 6748,84 26.05.2009 23:55:00
6 2 5184,32 26.05.2009 23:54:00
7 1 6832,27 26.05.2009 23:53:00
8 2 1572,45 26.05.2009 23:52:00
9 2 2287,1 26.05.2009 23:51:00
10 1 2916,78 26.05.2009 23:50:00

For each currency, we have one rate per day, except for Saturdays and Sundays (that's how most central banks work):

SELECT  *
FROM    (
        SELECT  *
        FROM    t_rate
        ORDER BY
                rte_date DESC
        )
WHERE   rownum <= 15

RTE_CURRENCY RTE_DATE RTE_RATE
1 26.05.2009 00:00:00 23,9699
2 26.05.2009 00:00:00 29,0941
1 25.05.2009 00:00:00 24,6972
2 25.05.2009 00:00:00 29,6727
1 22.05.2009 00:00:00 25,0568
2 22.05.2009 00:00:00 30,5586
1 21.05.2009 00:00:00 24,0338
2 21.05.2009 00:00:00 29,2726
1 20.05.2009 00:00:00 26,4241
2 20.05.2009 00:00:00 31,1259
1 19.05.2009 00:00:00 24,3832
2 19.05.2009 00:00:00 31,2603
1 18.05.2009 00:00:00 25,891
2 18.05.2009 00:00:00 31,1934
1 15.05.2009 00:00:00 25,5071

Now, we need to find the sum of these transactions, converted to the main currency.

To convert to the main currency, we should multiply to effective rate for a given date.

An effective rate is the last rate published up to the given date.

A typical query to do this looks like this:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM    t_transaction x
JOIN    t_rate r
ON      (rte_currency, rte_date) IN
        (
        SELECT  xac_currency, MAX(rte_date)
        FROM    t_rate
        WHERE   rte_currency = xac_currency
                AND rte_date <= xac_date
        )

TO_CHAR(SUM(XAC_AMOUNT*RTE_RATE),'FM999G999G999G999G999G999D999999')
137 716 183 880,132757
1 row fetched in 0.0001s (9.7030s)
SELECT STATEMENT 
 SORT AGGREGATE
  NESTED LOOPS 
   TABLE ACCESS FULL, 20090527_rates.T_TRANSACTION
   TABLE ACCESS BY INDEX ROWID, 20090527_rates.T_RATE
    INDEX UNIQUE SCAN, 20090527_rates.PK_RATE_CURRENCY_DATE
     SORT AGGREGATE
      FIRST ROW 
       INDEX RANGE SCAN (MIN/MAX), 20090527_rates.PK_RATE_CURRENCY_DATE

This works, and works quite fast (less that 10 seconds), which is OK for 1,000,000 rows.

But can we improve it any more somehow?

If we look into the query plan we will see that the NESTED LOOPS are used to join t_transaction and t_rate.

This join method is quite inefficient for large datasets, but it's the only possible method, since we need to find the effective rate value for each transaction.

Neither HASH JOIN nor MERGE JOIN can be used here, since we have a less or equal JOIN condition.

To use HASH JOIN which would be most efficient here, we need to build a rowset that would be joinable with an equality condition with each row from t_transaction.

Can we build such a resultset?

We only have to use inequality condition here because there are gaps in the rate table for the days the rates were not published.

If we had a complete set of days, without any gaps, we could just truncate each transaction's date and use it in a join.

With Oracle, it's possible to build such a set.

Here are the steps to do that:

  1. Build a set of days, starting from the minimal one and ending with a maximal one of those found in t_rate. This can be done with a hierarchical query over dual
  2. Build a set of currencies. There are only 2 of them, we can do it with a plain UNION ALL
  3. Make a cartesian product of these sets so that we have every day and every currency for the range covered by t_rate
  4. Join this with t_rate, finding effective rate for each day and each currency
  5. Group all transactions by truncated date and find their sums inside each date
  6. Join the resulting sets and find the total sum

Here is the resulting query:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date
TO_CHAR(SUM(XAC_AMOUNT*EFF_RATE),'FM999G999G999G999G999G999D999999')
137 716 183 880,132757
1 row fetched in 0.0001s (1.6563s)
SELECT STATEMENT 
 SORT AGGREGATE
  HASH JOIN 
   NESTED LOOPS OUTER
    VIEW 
     MERGE JOIN CARTESIAN
      VIEW 
       CONNECT BY WITHOUT FILTERING
        FAST DUAL 
      BUFFER SORT
       VIEW 
        UNION-ALL 
         FAST DUAL 
         FAST DUAL 
    TABLE ACCESS BY INDEX ROWID, 20090527_rates.T_RATE
     INDEX UNIQUE SCAN, 20090527_rates.PK_RATE_CURRENCY_DATE
   VIEW 
    HASH GROUP BY
     TABLE ACCESS FULL, 20090527_rates.T_TRANSACTION

Now it's only 1.65 seconds, or 6 times as fast.

This query is quite complex, so I think this will need some explanation.

                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM   t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date

This is a plain hierarchical query over dual, which is widely used in Oracle to generate a series of numbers.

This returns all possible dates from MAX(rte_date) to MIN(rte_date).

                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency

The previous query is JOIN'ed with a set of currencies (1 and 2) to produce a cartesian product. The resultset has every date for every currency.

                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff

We select dates and currencies for the previous set and one more field: the last date before the given when the rate was published. Since (rte_currency, rte_date) pair is unique for each rate, we can use it in a JOIN.

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )

And here's the JOIN. The CTE above produces effective rate for every possible (currency, date) pair of those covered by t_rate.

        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)

This query selects SUM's of all transactions grouped by currency and date. Note that we have to use TRUNC(xac_date) here: unlike rates, the time part is set and matters for the transactions.

Finally, we just join these queries:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date

Since the NESTED LOOPS are used to build only a relatively small resultset, and the larger one is built with a HASH JOIN, overall query performance is much higher.

Written by Quassnoi

May 27th, 2009 at 11:00 pm

Posted in Oracle

Leave a Reply