Efficient INSERT ON DUPLICATE KEY UPDATE
Comments enabled. I *really* need your comment
From Stack Overflow:
I'm implementing a memory cache for a table in MySQL that looks like this:
item1 item2 cnt The original table includes millions of pairs like this and updates rapidly.
To make it all more efficient I want to write new pairs to an identical memory table and update the real table on disk periodically by cron.
The cron should do the following: for each pair, if there is similar pair in the non-memory table, increase the count by the count from the memory table.
If no such pair exists, create it with count from the memory table.
How can I make the flush (from memory table to real table) most efficient?
This is of course best achieved with INSERT ON DUPLICATE KEY UPDATE
.
It's a MySQL flavor of INSERT
operator that automatically handles key violations and updates the row that causes a violation instead of inserting.
If we provide a SELECT
statement as a row source for INSERT ON DUPLICATE KEY UPDATE
, multiple input rows may cause the key violation on the target row. In this case, the target row will be updated many times.
To achieve the desired result, we have two options:
- Use the memory table as a row source for
INSERT ON DUPLICATE KEY UPDATE
as is. The target rows will be updated multiple times. - Aggregate the counters in the memory table and use the aggregated values as the row source. The target rows will be updated at most once.
The first solution will update the rows many times (which can be slow), while the second one involves some overhead on GROUP BY
.
Let's create the sample tables and see which one is more efficient:
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_target ( id INT NOT NULL PRIMARY KEY, cnt INT NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; CREATE TABLE t_dense ( id INT NOT NULL PRIMARY KEY, target INT NOT NULL, cnt INT NOT NULL, KEY ix_dense_target (target) ) ENGINE=Memory; CREATE TABLE t_sparse ( id INT NOT NULL PRIMARY KEY, target INT NOT NULL, cnt INT NOT NULL, KEY ix_sparse_target (target) ) ENGINE=Memory; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(100000); COMMIT; INSERT INTO t_target SELECT id, 1 FROM filler; INSERT INTO t_dense SELECT id, id, (RAND(20090627) * 10000) + 1 FROM filler; INSERT INTO t_sparse SELECT id, FLOOR((id - 1) / 1000) + 1, (RAND(20090627) * 10) + 1 FROM filler;
We create one target table (InnoDB) and two source tables (Memory).
The target table, t_target
, has 100,000 rows.
First source table, t_sparse
, has 100,000 with 100,000 distinct values of target
.
Second source table, t_dense
, has 100,000 with but 100 distinct values of target
.
Let's run the queries against the sparse table first:
INSERT INTO t_target SELECT target, cnt FROM t_sparse d ON DUPLICATE KEY UPDATE t_target.cnt = t_target.cnt + d.cnt
Query OK, 200000 rows affected (3.41 sec) Records: 100000 Duplicates: 100000 Warnings: 0
INSERT INTO t_target SELECT target, cnt FROM ( SELECT target, SUM(cnt) AS cnt FROM t_sparse di GROUP BY target ) d ON DUPLICATE KEY UPDATE t_target.cnt = t_target.cnt + d.cnt
Query OK, 200000 rows affected (4.70 sec) Records: 100000 Duplicates: 100000 Warnings: 0
As we can see, the GROUP BY
is inefficient here.
Since the values of target
are in fact unique in this table, GROUP BY
adds nothing to performance (exactly same number of rows still need to be updated), but takes some time to aggregate the values.
Now, let's run the same queries on the dense table:
INSERT INTO t_target SELECT target, cnt FROM t_dense d ON DUPLICATE KEY UPDATE t_target.cnt = t_target.cnt + d.cnt
Query OK, 200000 rows affected (3.27 sec) Records: 100000 Duplicates: 100000 Warnings: 0
Using the table as is yields same performance for both dense and sparse table, since exactly same number of rows need to be updated.
Let's try the GROUP BY
query against the dense table:
INSERT INTO t_target SELECT target, cnt FROM ( SELECT target, SUM(cnt) AS cnt FROM t_dense di GROUP BY target ) d ON DUPLICATE KEY UPDATE t_target.cnt = t_target.cnt + d.cnt
Query OK, 200 rows affected (0.11 sec) Records: 100 Duplicates: 100 Warnings: 0
And this query is much more efficient: only 0.11 sec.
GROUP BY
pays for itself here: it shrinks the source rowset from 100,000 to 100 rows, and the target table need to be updated but 100 times, which can be done in no time.
Summary:
- For a source rowset having lots of distinct values that will lead to key violations (compared to rowset size), it's better to provide the rowset as an input to
INSERT ON DUPLICATE KEY UPDATE
as is, and let the rows causing key violations be updated multiple times. - For a source rowset having few such values, it's better to aggregate the rowset using
GROUP BY
and update the target table with aggregated values, since performance gain caused by having much less updates to the table overweights theGROUP BY
overhead.