Archive for April 1st, 2011
Shared Plan and Algorithm Network Cache (SPANC)
Due to the nature of my work I have to deal with various database systems.
While SQL is more or less standardized, the optimizers are implemented differently in the different systems. Some systems cannot join tables with anything other than nested loops, the other can only GROUP BY
using a sort, etc.
So when you write a join in, say, MySQL, you cannot expect it to be a sort merge join (and you should consider this fact when designing the query). Or, when you write a DISTINCT
in SQL Server, you can't expect a loose index scan. These are limitations put by their optimizers.
However, in the last three months I noticed a great improvement in the queries where I could not expect any.
It started when I tried to debug this in SQL Server:
SELECT DISTINCT order FROM orderItem
while yielded this plan:
Similar results were obtained on Oracle:
Plan hash value: 1345318323 --------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 200 | 2 (50)| 00:00:01 | | 1 | REMOTE SPANC QUERY (SQLSERVER, MYSQL, POSTGRESQL, STACKOVERFLOW) | | | 200 | 2 (50)| 00:00:01 | ---------------------------------------------------------------------------------------------------------------------------------
, MySQL:
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+ | 1 | SIMPLE | orderItem | spanc | NULL | ALL | NULL | NULL | | Using Oracle, PostgreSQL, SQL Server, StackOverflow | +----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
and PostgreSQL:
Seq Scan on OrderItem (cost=0.00..6.44 width=4) -> Remote Scan on SPANC (Oracle, MySQL, SQL Server, StackOverflow) (cost=0.00..100.00 width=4)
Network analysis has shown weird encrypted activity between the servers in my internal network which host SQL Server, Oracle, PostgreSQL and MySQL servers.
Ultimately, there was unencrypted activity outside of the internal network which turned out to be an HTTP POST
request followed by several GET
polls to http://stackoverflow.com/questions/5518080/distinct-optimization.
It seems that the developers of major database systems agreed to share the knowledge about the most efficient query plans in some kind of a distributed storage (which probably is called SPANC as we can see in the query plans) and provide an interface to access each other's systems.
It also seems that these systems treat Stack Overflow as an external optimization engine where the most experienced developers can build their plans for them in a most efficient way.
I would be glad to have further clarification from the companies staff.
This also begs a question: how many of regular Stack Overflow participants are in fact query engines disguised as curious fellow developers?
It would be definitely nice to know.