EXPLAIN EXTENDED

How to create fast database queries

Choosing column order

with one comment

From Stack Overflow:

I know you can ALTER the column order in MySQL with FIRST and AFTER, but why would you want to bother?

Since good queries explicitly name columns when inserting data, is there really any reason to care what order your columns are in in the table?

Well, there is reason.

MySQL stores some types (out of which VARCHAR is probably the most important) with variable length.

That means that each variable field stored in the row is prepended with its actual length, like in Pascal string type.

Since the rows have no TOC, it means that the engine should traverse all previous values to find out where the required value begins.

If we need the value of the first column (first non-PRIMARY KEY column, of course), the engine will do it once per row.

But if we need the value of, say, 20-th column, the engine should find out the lengths of previous 19 columns and add them together.

Let's create a sample table and see how big the impact is:

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_column (
id INT NOT NULL PRIMARY KEY,
col1 VARCHAR(10),
col2 VARCHAR(10),
col3 VARCHAR(10),
col4 VARCHAR(10),
col5 VARCHAR(10),
col6 VARCHAR(10),
col7 VARCHAR(10),
col8 VARCHAR(10),
col9 VARCHAR(10),
col10 VARCHAR(10),
col11 VARCHAR(10),
col12 VARCHAR(10),
col13 VARCHAR(10),
col14 VARCHAR(10),
col15 VARCHAR(10),
col16 VARCHAR(10),
col17 VARCHAR(10),
col18 VARCHAR(10),
col19 VARCHAR(10),
col20 VARCHAR(10)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_column
SELECT  id,
        LPAD('', RAND(20090521) * 10, 'a'), 
        LPAD('', RAND(20090521 << 1) * 10, 'a'), 
        LPAD('', RAND(20090521 << 2) * 10, 'a'), 
        LPAD('', RAND(20090521 << 3) * 10, 'a'), 
        LPAD('', RAND(20090521 << 4) * 10, 'a'), 
        LPAD('', RAND(20090521 << 5) * 10, 'a'), 
        LPAD('', RAND(20090521 << 6) * 10, 'a'), 
        LPAD('', RAND(20090521 << 7) * 10, 'a'), 
        LPAD('', RAND(20090521 << 8) * 10, 'a'), 
        LPAD('', RAND(20090521 << 9) * 10, 'a'), 
        LPAD('', RAND(20090521 << 10) * 10, 'a'), 
        LPAD('', RAND(20090521 << 11) * 10, 'a'), 
        LPAD('', RAND(20090521 << 12) * 10, 'a'), 
        LPAD('', RAND(20090521 << 13) * 10, 'a'), 
        LPAD('', RAND(20090521 << 14) * 10, 'a'), 
        LPAD('', RAND(20090521 << 15) * 10, 'a'), 
        LPAD('', RAND(20090521 << 16) * 10, 'a'), 
        LPAD('', RAND(20090521 << 17) * 10, 'a'), 
        LPAD('', RAND(20090521 << 18) * 10, 'a'), 
        LPAD('', RAND(20090521 << 19) * 10, 'a')
FROM    filler;

We have a table with an INTEGER id and 20 nullable VARCHAR(10) columns, filled with strings of random length.

Now, let's try to find the sum of lengths of the first column:

SELECT  SUM(LENGTH(col1))
FROM    t_column

SUM(LENGTH(col1))
5005271
1 row fetched in 0.0001s (1.0937s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_column ALL 997594 100.00

This query takes 1.09 seconds.

Same on the 10'th column:

SELECT  SUM(LENGTH(col10))
FROM    t_column

SUM(LENGTH(col10))
5002918
1 row fetched in 0.0001s (1.2031s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_column ALL 997594 100.00

Now, it 1.20 seconds, or 10% more.

And, finally, same query for the 20'th column:

SELECT  SUM(LENGTH(col20))
FROM    t_column

SUM(LENGTH(col20))
5006319
1 row fetched in 0.0001s (1.2812s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_column ALL 997594 100.00

It's 1.28 seconds, or 17% more than for the first column.

Summary:

In case of many variable length columns on the table, the column order may affect the performance of the queries.

The less close a column is to the beginning of the row, the more preceding columns InnoDB engine should examine to find out the offset of a given one.

Columns that are closer to the beginning of the table are selected faster.

Written by Quassnoi

May 21st, 2009 at 11:00 pm

Posted in MySQL

One Response to 'Choosing column order'

Subscribe to comments with RSS

  1. Further experimenting (and probably newer machine, version (5.6.12), etc):

    col1 takes 0.34s; any other col takes 0.36s (6% slower). From that, I conclude that the first column is special cased.

    We really need someone to look at the code and see what differences are really likely.

    Rick James

    9 Jun 15 at 21:38

Leave a Reply