EXPLAIN EXTENDED

How to create fast database queries

Efficient INSERT ON DUPLICATE KEY UPDATE

Comments enabled. I *really* need your comment

From Stack Overflow:

I'm implementing a memory cache for a table in MySQL that looks like this:

item1 item2 cnt

The original table includes millions of pairs like this and updates rapidly.

To make it all more efficient I want to write new pairs to an identical memory table and update the real table on disk periodically by cron.

The cron should do the following: for each pair, if there is similar pair in the non-memory table, increase the count by the count from the memory table.

If no such pair exists, create it with count from the memory table.

How can I make the flush (from memory table to real table) most efficient?

This is of course best achieved with INSERT ON DUPLICATE KEY UPDATE.

It's a MySQL flavor of INSERT operator that automatically handles key violations and updates the row that causes a violation instead of inserting.

If we provide a SELECT statement as a row source for INSERT ON DUPLICATE KEY UPDATE, multiple input rows may cause the key violation on the target row. In this case, the target row will be updated many times.

To achieve the desired result, we have two options:

  1. Use the memory table as a row source for INSERT ON DUPLICATE KEY UPDATE as is. The target rows will be updated multiple times.
  2. Aggregate the counters in the memory table and use the aggregated values as the row source. The target rows will be updated at most once.

The first solution will update the rows many times (which can be slow), while the second one involves some overhead on GROUP BY.

Let's create the sample tables and see which one is more efficient:

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_target (
id INT NOT NULL PRIMARY KEY,
cnt INT NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_dense (
id INT NOT NULL PRIMARY KEY,
target INT NOT NULL,
cnt INT NOT NULL,
KEY ix_dense_target (target)
) ENGINE=Memory;

CREATE TABLE t_sparse (
id INT NOT NULL PRIMARY KEY,
target INT NOT NULL,
cnt INT NOT NULL,
KEY ix_sparse_target (target)
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;

INSERT
INTO    t_target
SELECT  id, 1
FROM    filler;

INSERT
INTO    t_dense
SELECT  id, id, (RAND(20090627) * 10000) + 1
FROM    filler;

INSERT
INTO    t_sparse
SELECT  id, FLOOR((id - 1) / 1000) + 1, (RAND(20090627) * 10) + 1
FROM    filler;

We create one target table (InnoDB) and two source tables (Memory).

The target table, t_target, has 100,000 rows.

First source table, t_sparse, has 100,000 with 100,000 distinct values of target.

Second source table, t_dense, has 100,000 with but 100 distinct values of target.

Let's run the queries against the sparse table first:

INSERT
INTO    t_target
SELECT  target, cnt
FROM    t_sparse d
ON DUPLICATE KEY UPDATE
t_target.cnt = t_target.cnt + d.cnt
Query OK, 200000 rows affected (3.41 sec)
Records: 100000  Duplicates: 100000  Warnings: 0
INSERT
INTO    t_target
SELECT  target, cnt
FROM    (
SELECT  target, SUM(cnt) AS cnt
FROM    t_sparse di
GROUP BY
target
) d
ON DUPLICATE KEY UPDATE
t_target.cnt = t_target.cnt + d.cnt
Query OK, 200000 rows affected (4.70 sec)
Records: 100000  Duplicates: 100000  Warnings: 0

As we can see, the GROUP BY is inefficient here.

Since the values of target are in fact unique in this table, GROUP BY adds nothing to performance (exactly same number of rows still need to be updated), but takes some time to aggregate the values.

Now, let's run the same queries on the dense table:

INSERT
INTO    t_target
SELECT  target, cnt
FROM    t_dense d
ON DUPLICATE KEY UPDATE
t_target.cnt = t_target.cnt + d.cnt
Query OK, 200000 rows affected (3.27 sec)
Records: 100000  Duplicates: 100000  Warnings: 0

Using the table as is yields same performance for both dense and sparse table, since exactly same number of rows need to be updated.

Let's try the GROUP BY query against the dense table:

INSERT
INTO    t_target
SELECT  target, cnt
FROM    (
SELECT  target, SUM(cnt) AS cnt
FROM    t_dense di
GROUP BY
target
) d
ON DUPLICATE KEY UPDATE
t_target.cnt = t_target.cnt + d.cnt
Query OK, 200 rows affected (0.11 sec)
Records: 100  Duplicates: 100  Warnings: 0

And this query is much more efficient: only 0.11 sec.

GROUP BY pays for itself here: it shrinks the source rowset from 100,000 to 100 rows, and the target table need to be updated but 100 times, which can be done in no time.

Summary:

  • For a source rowset having lots of distinct values that will lead to key violations (compared to rowset size), it's better to provide the rowset as an input to INSERT ON DUPLICATE KEY UPDATE as is, and let the rows causing key violations be updated multiple times.
  • For a source rowset having few such values, it's better to aggregate the rowset using GROUP BY and update the target table with aggregated values, since performance gain caused by having much less updates to the table overweights the GROUP BY overhead.

Written by Quassnoi

June 27th, 2009 at 11:00 pm

Posted in MySQL

Leave a Reply