Archive for the ‘Miscellaneous’ Category
Happy New Year!
This winter is anomalously warm in Europe, there is no snow and no New Year mood. So today we will be drawing a snowflake in PostgreSQL.
#1. A little theory
Core of a snowflake is six large symmetrical ice crystals growing from the common center. Out of these larger crystals other, smaller, crystals grow.
The overall shape of the snowflake is defined by how large do crystals grow and where exactly are they attached to each other.
These things are defined by fluctuations in air and temperature conditions around the snowflake. Because the flake itself is very small, in any given moment the conditions are nearly identical around each crystal, that’s why the offspring crystals pop up in almost same places and grow to almost same lengths. Different flakes, though, constantly move to and from each other and are subject to very different fluctuations, and that’s why they grow so unique.
Except for the root crystals (of which there are six), the child icicles grow in symmetrical pairs. More than that, each branch grows their own children (also in pairs), so on each step there are twice as many crystals, but they all share almost same length and angle. This gives the snowflake its symmetrical look.
So we can easily see that, despite the fact there may be many child crystals, the shape of a snowflake is defined by a relatively small number of parameters: how many children each crystal produces, where are they attached to it, at which angle they grow and to which length.
Now, let’s try to model it.
Read the rest of this entry »
Shared Plan and Algorithm Network Cache (SPANC)
Due to the nature of my work I have to deal with various database systems.
While SQL is more or less standardized, the optimizers are implemented differently in the different systems. Some systems cannot join tables with anything other than nested loops, the other can only GROUP BY using a sort, etc.
So when you write a join in, say, MySQL, you cannot expect it to be a sort merge join (and you should consider this fact when designing the query). Or, when you write a DISTINCT in SQL Server, you can’t expect a loose index scan. These are limitations put by their optimizers.
However, in the last three months I noticed a great improvement in the queries where I could not expect any.
It started when I tried to debug this in SQL Server:
SELECT DISTINCT order FROM orderItem
while yielded this plan:

Similar results were obtained on Oracle:
Plan hash value: 1345318323 --------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 200 | 2 (50)| 00:00:01 | | 1 | REMOTE SPANC QUERY (SQLSERVER, MYSQL, POSTGRESQL, STACKOVERFLOW) | | | 200 | 2 (50)| 00:00:01 | ---------------------------------------------------------------------------------------------------------------------------------
, MySQL:
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+ | 1 | SIMPLE | orderItem | spanc | NULL | ALL | NULL | NULL | | Using Oracle, PostgreSQL, SQL Server, StackOverflow | +----+-------------+-----------+-------+---------------+---------+---------+------+---------+-----------------------------------------------------+
and PostgreSQL:
Seq Scan on OrderItem (cost=0.00..6.44 width=4) -> Remote Scan on SPANC (Oracle, MySQL, SQL Server, StackOverflow) (cost=0.00..100.00 width=4)
Network analysis has shown weird encrypted activity between the servers in my internal network which host SQL Server, Oracle, PostgreSQL and MySQL servers.
Ultimately, there was unencrypted activity outside of the internal network which turned out to be an HTTP POST request followed by several GET polls to http://stackoverflow.com/questions/5518080/distinct-optimization.
It seems that the developers of major database systems agreed to share the knowledge about the most efficient query plans in some kind of a distributed storage (which probably is called SPANC as we can see in the query plans) and provide an interface to access each other’s systems.
It also seems that these systems treat Stack Overflow as an external optimization engine where the most experienced developers can build their plans for them in a most efficient way.
I would be glad to have further clarification from the companies staff.
This also begs a question: how many of regular Stack Overflow participants are in fact query engines disguised as curious fellow developers?
It would be definitely nice to know.
Things SQL needs: SERIES()
Recently I had to deal with several scenarios which required processing and aggregating continuous series of data.

I believe this could be best illustrated with an example:
| id | source | value |
|---|---|---|
| 1 | 1 | 10 |
| 2 | 1 | 20 |
| 3 | 2 | 15 |
| 4 | 2 | 25 |
| 5 | 1 | 45 |
| 6 | 3 | 50 |
| 7 | 3 | 35 |
| 8 | 1 | 40 |
| 9 | 1 | 10 |
The records are ordered by id, and within this order there are continuous series of records which share the same value of source. In the table above, the series are separated by thick lines.
We want to calculate some aggregates across each of the series: MIN, MAX, SUM, AVG, whatever:
| source | min | max | sum | avg |
|---|---|---|---|---|
| 1 | 10 | 20 | 30 | 15.00 |
| 2 | 15 | 25 | 40 | 20.00 |
| 1 | 45 | 45 | 45 | 45.00 |
| 3 | 35 | 50 | 85 | 42.50 |
| 1 | 10 | 40 | 50 | 25.00 |
This can be used for different things. I used that for:
- Reading sensors from a moving elevator (thus tracking its position)
- Recording user’s activity on a site
- Tracking the primary node in a server cluster
, but almost any seasoned database developer can recall a need for such a query.
As you can see, the values of source are repeating so a mere GROUP BY won’t work here.
In the systems supporting window functions there is a workaround for that:
Happy New Year!
Some say SQL is not good at graphics.
Well, they have some point. Database engines lack scanner drivers, there is no easy way to do sepia, and magic wand, let’s be honest, is just poor.
However, you can make some New Year paintings with SQL.
Let’s make a New Year clock showing 12 o’clock in Oracle.
Things SQL needs: determining range cardinality
What is the problem with this query?
SELECT *
FROM orders
WHERE quantity < = 4
AND urgency <= 4
The problem is indexing strategy, of course. Which columns should we index?
If we index quantity, the optimizer will be able to use the index to filter on it. However, filtering on urgency will require scanning all records with quantity < 4 and applying the urgency filter to each record found.
Same with urgency. We can use range access on urgency using an index, but this will require filtering on quantity.
Why, create a composite index!
, some will say.
Unfortunately, that won’t help much.
A composite B-Tree index maintains what is called a lexicographical order of the records. This means that an index on (quantity, urgency) will sort on quantity, and only if the quantities are equal, it will take the urgency into account.
The picture below shows how would the records be ordered in such an index:

As we can see, with a single index range scan (i. e. just following the arrows) we cannot select only the records within the dashed rectangle. There is no single index range that could be used to filter on both columns.
Even if we changed the field order in the index, it would just change the direction of the arrows connecting the records:
Read the rest of this entry »
Things SQL needs: MERGE JOIN that would seek
One of the most known and least used join algorithms in SQL engines is MERGE JOIN.
This algorithm operates on two sorted recordsets, keeping two pointers that chase each other.
The Wikipedia entry above describes it quite well in terms of algorithms. I’ll just make an animated GIF to make it more clear:

This is quite a nice and elegant algorithm, which, unfortunately, has two major drawbacks:
- It needs the recordsets to be sorted
- Even with the recordsets sorted, it is no better than a
HASH JOIN
The sorting part is essential for this algorithm and there is nothing that can be done with it: the recordsets should be sorted, period. Databases, however, often provide the records in the sorted order: from clustered tables, indexes, previously sorted and ordered subqueries, spool tables etc.
But even when the recordsets are already sorted, on equijoins the MERGE JOIN is hardly faster than a HASH JOIN.
Why?
Read the rest of this entry »
NoSQL
I had a vision tonight.

Image by nataliej
A huge, dark, grim figure approached me, seized me with its long bony arms and made me see all the vanity of the world we are living in.
Bloated database engines, useless ACID requirements, meaningless joins are now in the past for me.
I decided to move to NoSQL.
Where do I begin?
Things SQL needs: sargability of monotonic functions

Image by ndevil
I’m going to write a series of articles about the things SQL needs to work faster and more efficienly.
With these things implemented, most of the tricks I explain in my blog will become redundant and I’ll finally have some more time to spend with the family.
Ever wondered why a condition like this:
WHERE TRUNC(mydate) = TRUNC(SYSDATE)
, which searches for the current day’s records, is so elegant but so slow?
Of course this is because even if you create an index on mydate, this index cannot be used.
The expression in the left part of the equality is not a mydate. The database engine cannot find a way to use an index to search for it. It is said that this expression is not sargable.
Now, a little explanation about the indexes and sargability. If you are familiar with these, you can skip this chapter. But beware that this chapter is the only one illustrated, so skipping it will make the article too boring to read.
Ahem.
To locate a record in a B-Tree index, the keys of the index should be compared to the value being searched for.
Let’s consider this sample B-Tree index:
Happy New Year!
Oracle New Year tree
WITH tree AS
(
SELECT /*+ MATERIALIZE */
' ,,,"''' AS needles, 'Oo%$&' AS decorations
FROM dual
),
branches AS
(
SELECT level AS id, level - TRUNC((level - 2) / 6) * 2 AS s
FROM dual
CONNECT BY
level <= 24
)
SELECT RPAD(' ', 18, ' ') || '*' AS tree
FROM dual
UNION ALL
SELECT RPAD(' ', 18 - s, ' ') || '/' ||
(
SELECT REPLACE(
SYS_CONNECT_BY_PATH
(
CASE
WHEN DBMS_RANDOM.value < 0.1 THEN
SUBSTR(decorations, TRUNC(DBMS_RANDOM.value * LENGTH(decorations)) + 1, 1)
ELSE
SUBSTR(needles, TRUNC(DBMS_RANDOM.value * LENGTH(needles)) + 1, 1)
END , '/'
),
'/', ''
)
FROM tree
WHERE CONNECT_BY_ISLEAF = 1
CONNECT BY
level <= s * 2 - 1
)
|| '\'
FROM branches
| TREE |
|---|
| * |
| /,\ |
| /,, \ |
| /$," ,\ |
| /, ',, \ |
| / ,'o'' ,\ |
| /,, ,'",,,, \ |
| /,',,$, '' ,,\ |
| /,, ,'",,,, \ |
| /",, ,,"' , ",\ |
| / $' ,, " ,' '\ |
| /, , '"O,,'o"," "\ |
| /,,,," , ',,',,',',\ |
| /,',,"' ' ' ,,,,&"o,\ |
| /,' ,,," ,',%" ,,,\ |
| /,',,"' ' ' ,,,,&"o,\ |
| /,$,,," ,,$'%",",,%,,''"\ |
| / ' ,, ," ,,& , ,, ,, \ |
| /,",O' ,',, , ,$, "',",,,',\ |
| /,," ",, ,,, ,',', " ,""',\ |
| / ,O ',,,,," o , '','', ,&,\ |
| /,," ",, ,,, ,',', " ,""',\ |
| /,"'',,""$"," ,$,%,' "" ,,,',",\ |
| /, '" "O,,o' , ,'% $',,, ,""" , \ |
| /, O' ,,, &,',,' , ',$, ,' ,,, ,,'\ |
SQL Server bottle of champagne
WITH lines AS
(
SELECT 1 AS n
UNION ALL
SELECT n + 1
FROM lines
WHERE n < 24
),
bottle_parts (part, lft, body, rgt, w) AS
(
SELECT 'mouth', '.', '_', '.', 2
UNION ALL
SELECT 'neck', '|', ' ', '|', 2
UNION ALL
SELECT 'bell', '/', ' ', '\', NULL
UNION ALL
SELECT 'border', '/', '=', '\', NULL
UNION ALL
SELECT 'body', '|', ' ', '|', 7
UNION ALL
SELECT 'bottom', '(', '_', ')', 7
),
bottle_lines AS
(
SELECT CASE
WHEN n = 1 THEN 'mouth'
WHEN n BETWEEN 2 AND 8 THEN 'neck'
WHEN n = 11 THEN 'border'
WHEN n BETWEEN 9 AND 13 THEN 'bell'
WHEN n BETWEEN 13 AND 23 THEN 'body'
WHEN n = 24 THEN 'bottom'
END AS part,
n
FROM lines
),
bottle_width AS
(
SELECT bp.*, n, COALESCE(w, (n - 6)) AS width
FROM bottle_lines bl
JOIN
bottle_parts bp
ON bp.part = bl.part
),
bottle AS
(
SELECT n, width, REPLICATE(' ', 18 - width) + lft + REPLICATE(body, width * 2) + rgt AS line
FROM bottle_width
),
bubbles (pos, m, bubble) AS
(
SELECT CAST(ROUND(RAND() * 4, 0, 1) AS INTEGER), 1, ':' AS bubble
UNION ALL
SELECT pos + ASCII(SUBSTRING(CAST(NEWID() AS VARCHAR(MAX)), 1, 1)) % 3 - 1, m + 1, SUBSTRING(':.oO', m / 3 + 1, 1)
FROM bubbles
WHERE m < 12
),
limits AS
(
SELECT *,
18 +
CASE
WHEN pos < 1 - width THEN 1 - width
WHEN pos > width - 1 THEN width - 1
ELSE pos
END AS rpos
FROM bottle bo
LEFT HASH JOIN
bubbles bu
ON bu.m = 24 - bo.n
)
SELECT SUBSTRING(line, 1, COALESCE(rpos, 1) - 1) + COALESCE(bubble, ' ') + SUBSTRING(line, COALESCE(rpos, 1) + 1, 200) AS bottle
FROM limits
ORDER BY
n
| bottle |
|---|
| .____. |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| / \ |
| / \ |
| /==========\ |
| / O \ |
| / O \ |
| | O | |
| | o | |
| | o | |
| | o | |
| | . | |
| | . | |
| | . | |
| | : | |
| | : | |
| | : | |
| (______________) |
PostgreSQL fireworks
WITH centers AS
(
SELECT angle,
len,
ROUND(len * SIN(2 * PI() * angle)) AS x,
ROUND(len * COS(2 * PI() * angle)) AS y,
ROUND(len * 0.3)::INTEGER + 1 AS trace
FROM (
SELECT RANDOM() AS angle,
8 * (1 - POWER(RANDOM(), 3)) AS len
FROM generate_series (1, 50) s
) q
),
traces AS
(
SELECT *,
generate_series(1, trace) AS part
FROM centers
),
parts AS
(
SELECT CASE
WHEN trace = part THEN
LEAST(len * 0.2, 2)::INTEGER
ELSE
TRUNC(angle * 8 - 0.5)::INTEGER % 4 + 3
END AS symbol,
TRUNC(x + part * SIN(2 * PI() * angle)) AS x,
TRUNC(y + part * COS(2 * PI() * angle)) AS y
FROM traces
)
SELECT ARRAY_TO_STRING(
ARRAY(
SELECT COALESCE(
(
SELECT SUBSTR(E'.xX\\-/|', MIN(symbol) + 1, 1)
FROM parts
WHERE x = col - 14
AND y = row - 12
), ' ')
FROM generate_series(1, 25) col
), ''
) AS FIREWORKS
FROM generate_series(1, 24) row
| fireworks |
|---|
| X |
| | X |
| x| | X |
| x | | / |
| X\ \x | /x |
| X\ \ \ / |
| X \ \ x / /xx |
| \\ x |. /// |
| X– \ | |
| xxx– . |
| Xx– .. |
| xxx-/ –x |
| . \\ –X |
| . . |
| / \ |
| X/ | \ \x |
| / |\ \\ |
| // |x\ \ \x |
| x/ |\x \ X |
| X x\x X |
| X |
MySQL postcard
SELECT CONCAT(COALESCE(border, '|'), RPAD(COALESCE(postcard, ''), 61, COALESCE(filler, ' ')), COALESCE(border, '|')) AS postcard
FROM (
SELECT id2 * 5 + id1 AS line
FROM (
SELECT 1 AS id1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
) q1
CROSS JOIN
(
SELECT 0 AS id2 UNION ALL SELECT 1
) q2
) dummy
LEFT JOIN
(
SELECT 3 AS line, CONCAT(' Dear ', USER(), ',') AS postcard
UNION ALL
SELECT 4, ' I wish you good luck in the New Year.'
UNION ALL
SELECT 5, ' May all your plans be optimal, all your queries be answered'
UNION ALL
SELECT 6, ' and all your estimations be correct.'
UNION ALL
SELECT 8, ' Yours sincerely,'
UNION ALL
SELECT 9, ' Quassnoi'
) ln
ON ln.line = dummy.line
LEFT JOIN
(
SELECT 1 AS line, '.' AS border, '_' AS filler
UNION ALL
SELECT 10 AS line, '.' AS border, '_' AS filler
) borders
ON borders.line = dummy.line
ORDER BY
dummy.line
| postcard |
|---|
| ._____________________________________________________________. |
| | | |
| | Dear dbuser@localhost, | |
| | I wish you good luck in the New Year. | |
| | May all your plans be optimal, all your queries be answered | |
| | and all your estimations be correct. | |
| | | |
| | Yours sincerely, | |
| | Quassnoi | |
| ._____________________________________________________________. |
What is entity-relationship model?

Image by Samuel Mann
A relational database, as we all know from the previous article, stores relations between integers, strings and other simple types in a very plain way: it just enumerates all values related.
This model is extremely flexible, since other relations can be easily derived from existing ones using mathematical formulae, and the relational database takes care of that.
However, database should reflect the real world in some way to be really of use. Since the relational databases store relations between mathematical abstractions and not real world things, we should make some kind of a mapping of ones to the others.
This is what entity-relationship model is for.
An entity-relationship model, as one can easily guess from its name, models relationships between entities.
But since we know that databases do essentially the same, how does it differ from the database model?
- An entity-relationship model states which data and relations between them should be stored
- A database model states how these relations are stored
In other words, ER model is design and database model is one of the ways to implement it. ER model is said to be above the database model in the waterfall developement.
Note that in both cases I use the word stored above. The model says nothing of data and relations between them that may or should exist, only of those that should be stored. Every human being participates in thousands if not millions relationships, but an ER model should state which of them are to be stored, and a relational model should decide how to store them. Of course it would be nice to store them all (and police and insurance would be just happy), but the technology does not allow it yet. Don’t know whether it is good or bad.
A typical entity-relationship diagram in Peter Chen’s notation looks like this:

What does it mean?
Read the rest of this entry »

Subscribe in a reader