EXPLAIN EXTENDED

How to create fast database queries

Sargability of monotonic functions: example

with 2 comments

In my previous article I presented a proposal to add sargability of monotonic functions into the SQL engines.

In a nutshell: a monotonic function is a function that preserves the order of the argument so that it gives the larger results for the larger values of the argument. It is easy to prove that a B-tree with each key replaced by the result of the function will remain the valid B-Tree and hence can be used to search for ranges of function results just like it is used to search for ranges of values.

With a little effort, a B-Tree can also be used to search for the ranges of piecewise monotonic functions: those whose domain can be split into a number of continuous pieces with the function being monotonic within each piece (but it may be not monotonic and even not continuous across the pieces).

In this article, I'll demonstrate the algorithm to do that (implemented in pure SQL on PostgreSQL).

I will show how the performance of simple query

SELECT  *
FROM    t_sine
WHERE   SIN(value) BETWEEN 0.1234 AND 0.1235

could be improved if the sargability of monotonic functions had been implemented in the optimizer.

To do this, I will create a sample table:

Table creation details

This table contains 1,000,000 records with value randomly distributed from 0 to 101.

To select the records we need, we can use a very simple and straightforward query:

SELECT  *
FROM    t_sine
WHERE   SIN(value) BETWEEN 0.4452 AND 0.4453

View query details

which returns 31 records in 3.25 seconds.

The query uses a full table scan with the filter applied to each record.

Let's try to improve it.

Function description

According to the notation I proposed in the previous article, the monotony of the function SIN() should be described as this:

SIN(arg FLOAT) MONOTONIC PIECEWISE
DEFINED BY FLOOR(arg / PI() + 0.5)
CASE PIECE % 2
WHEN 0 THEN DECREASING INVERSE PIECE * PI() + ASIN(RESULT)
ELSE INCREASING INVERSE PIECE * PI() - ASIN(RESULT)
END

This means that:

  1. The function is piecewise monotonic,

  2. The pieces are defined by the function FLOOR(arg / PI() + 0.5) (which essentially returns the number of the half-wave the argument belongs too),

  3. The function monotony varies depending on the piece,

  4. On odd pieces, the function increases,

  5. On even pieces, the function decreases.

  6. A single inverse expression is provided for each monotony

Note that mathematically the function is strictly monotonic on each of its pieces. However, due to the rounding errors, different arguments can yield same function results, so the function value may map back to a range of the arguments rather than a single value.

In theory, it is possible to write a single expression which would map the function's result to the pair of values defining the beginning and the end of such a range. However, the expression would be quite complex. So for illustration purposes I'll make do with a single inverse function that yields an approximation of the back mapping. To find the exact range, some extra effort will be required.

Building the pieces

The function is piecewise monotonic and the pieces are defined by a function. For the pieces to be continuous, the function that defines them should be itself monotonic over all its domain.

The function that defines the pieces is FLOOR(arg / PI() + 0.5).

It is a superposition of the three functions:

  • OPERATOR_DIVISION(arg1 FLOAT, arg2 FLOAT)
    MONOTONIC OVER (arg1)
    CASE WHEN arg2 > 0 THEN INCREASING INVERSE RESULT * arg2
    WHEN arg2 = 0 THEN UNDEFINED
    WHEN arg2 < 0 THEN DECREASING INVERSE RESULT * arg2 END

  • OPERATOR_PLUS(arg1 FLOAT, arg2 FLOAT) MONOTONIC
    OVER (arg1) STRICTLY INCREASING INVERSE RESULT - arg2,
    OVER (arg2) STRICTLY INCREASING INVERSE RESULT - arg1

  • FLOOR(arg FLOAT) MONOTONIC INCREASING
    INVERSE
    FROM RESULT EXACT
    TO RESULT + 1 EXACT EXCLUDE

which are, given the values of the constants provided in the secondary arguments, are increasing over the argument. As we know from math, a superposition of monotonic functions is also monotonic.

Each function is defined with a single inverse condition which maps the result of the function back to a value near the range of the arguments yielding the result. The exact range has to be sought for using index seek over (hopefully) not too many records.

Sequentially applying the inverse expressions of each of the constituent functions to the result of the piece defining function, we get the following inverse expression for the latter:

  1. FLOOR(OPERATOR_PLUS(OPERATOR_DIVISION(arg, PI()), 0.5)) = PIECE
  2. OPERATOR_PLUS(OPERATOR_DIVISION(arg, PI()), 0.5) ∈ [ PIECE, PIECE + 1 )
  3. OPERATOR_DIVISION(arg, PI()) ∈ [ ≈(PIECE - 0.5), ≈((PIECE + 1) - 0.5) ]
  4. arg ∈ [ ≈((PIECE - 0.5) * PI()), ≈(((PIECE + 1) - 0.5) * PI()) ]

For each piece, we how have a pair of values approximately defining the range of values belonging to the piece.

To find out the exact bounds, we need to do the following:

  1. Calculate the piece for the minimal value

  2. Find the approximate upper bound for the piece.

  3. Scanning the keys to the left, find the rightmost key to the left of the upper bound that belongs to the current (or previous) piece.

  4. Scanning the keys to the right, find the first key of the next piece.

  5. Scanning a single key to the left, find the last key of the current piece.

  6. Recursively repeat steps 1 to 5, taking the first value the next piece calculated on step 4 as a seed for the step 1, until step 4 fails (which means that the pieces are over).

This procedure guarantees that we always get the correct bounds even with the inexact inverse value, since it correctly handles both overflow and underflow of the inverse value, as show on the pictures below:

Overflow

Underflow

Here's a query that selects the first and the last key of each piece:

View the query

Despite being huge in size, the query is very efficient and completes in only 48 ms.

Locating values within the pieces

Now, when we have the exact bounds of each piece, we need to locate the records within each piece.

Since we don't have exact inverse function here, the basic idea is the same as above: given the approximate inverse, locate the exact bound using the iterative approach:

  1. Locate the first key to the left of the inverse which yields the function result less than the one sought for, up to the first key of the piece. Should this search fail, the first key of the piece is the lower bound.
  2. Locate the first key to the right of that found on the previous step that yields the function value equal to or greater than the one sought for, up to the last key of the piece. Return NULL should it fail

Since we have an inclusive range here, we don't need the third step (final scan to the left to find the rightmost least value) that we used when searching for the pieces.

This algorithm searches for the lower bound; to search for the upper bound, we just need to inverse both directions and tests (left becomes right, less becomes greater etc).

Since the monotony of the function varies from piece to piece, we should take this into account. For the pieces where the function's monotony is DECREASING we should swap the order of the bounds: the upper bound of the expression becomes the lower bound or the range of values and vice versa. This can be handled merely by substituting the conditions into the very same CASE expression that defines the monotony.

When we locate the upper and the lower bounds for each piece, we should just join t_sine on the following condition:

ON value BETWEEN llimit and ulimit

It can happen so that the lower bound found by the algorithm exceeds the upper bound. This is a perfectly normal situation meaning that no keys match the condition and the range diverged. BETWEEN predicate will handle this.

It also can happen that one of the bounds is a NULL. This is also a valid situation, meaning that no value within the piece exceeds the lower bound (or falls short of the upper one). BETWEEN will also take care of it.

Final query

And here's the final query. Be ready, you'll have to spin your mouse wheel a lot:

WITH    RECURSIVE
        d AS (
        SELECT  piece,
                minv,
                COALESCE(
                (
                SELECT  value
                FROM    t_sine
                WHERE   value < nv[1]
                ORDER BY
                        value DESC
                LIMIT 1
                ),
                (
                SELECT  MAX(value)
                FROM    t_sine
                )
                ) AS maxv,
                nv[1] AS nextv,
                nv[2] AS nextpiece
        FROM    (
                SELECT  minv, piece,
                        (
                        SELECT  ARRAY[value, FLOOR(value / PI() + 0.5)]
                        FROM    t_sine
                        WHERE   value >
                                (
                                SELECT  value
                                FROM    t_sine
                                WHERE   value <= ((piece + 1) - 0.5) * PI()
                                        AND FLOOR(value / PI() + 0.5) <= piece
                                ORDER BY
                                        value DESC
                                LIMIT 1
                                )
                        ORDER BY
                                value
                        LIMIT 1
                        ) nv
                FROM    (
                        SELECT  minv, FLOOR(minv / PI() + 0.5) AS piece
                        FROM    (
                                SELECT  MIN(value) AS minv
                                FROM    t_sine
                                ) q
                        ) q2
                ) q3
        UNION ALL
        SELECT  piece,
                minv,
                COALESCE(
                (
                SELECT  value
                FROM    t_sine
                WHERE   value < nv[1]
                ORDER BY
                        value DESC
                LIMIT 1
                ),
                (
                SELECT  MAX(value)
                FROM    t_sine
                )
                ) AS maxv,
                nv[1] AS nextv,
                nv[2] AS nextpiece
        FROM    (
                SELECT  minv, piece,
                        (
                        SELECT  ARRAY[value, FLOOR(value / PI() + 0.5)]
                        FROM    t_sine
                        WHERE   value >
                                (
                                SELECT  value
                                FROM    t_sine
                                WHERE   value <= ((piece + 1) - 0.5) * PI()
                                        AND FLOOR(value / PI() + 0.5) <= piece
                                ORDER BY
                                        value DESC
                                LIMIT 1
                                )
                        ORDER BY
                                value
                        LIMIT 1
                        ) nv
                FROM    (
                        SELECT  nextv AS minv, nextpiece AS piece
                        FROM    d
                        WHERE   nextpiece IS NOT NULL
                        ) q2
                ) q3
        )
SELECT  l.*, s.*, SIN(value)
FROM    (
        SELECT  minv, maxv,
                CASE piece::INTEGER % 2
                WHEN 0 THEN
                        (
                        SELECT  value
                        FROM    t_sine
                        WHERE   value >=
                                COALESCE(
                                (
                                SELECT  value
                                FROM    t_sine
                                WHERE   value <= LEAST(piece * PI() + ASIN(0.4452), maxv)
                                        AND value >= minv
                                        AND SIN(value) < 0.4452
                                ORDER BY
                                        value DESC
                                LIMIT 1
                                ),
                                minv
                                )
                                AND value <= maxv 
                                AND SIN(value) >= 0.4452
                        ORDER BY
                                value
                        LIMIT 1
                        )
                ELSE
                        (
                        SELECT  value
                        FROM    t_sine
                        WHERE   value >=
                                COALESCE(
                                (
                                SELECT  value
                                FROM    t_sine
                                WHERE   value <= LEAST(piece * PI() - ASIN(0.4453), maxv)
                                        AND value >= minv
                                        AND SIN(value) > 0.4453
                                ORDER BY
                                        value DESC
                                LIMIT 1
                                ),
                                minv
                                )
                                AND value <= maxv 
                                AND SIN(value) <= 0.4453
                        ORDER BY
                                value
                        LIMIT 1
                        )
                END AS llimit,
                CASE piece::INTEGER % 2
                WHEN 0 THEN
                        (
                        SELECT  value
                        FROM    t_sine
                        WHERE   value <=
                                COALESCE(
                                (
                                SELECT  value
                                FROM    t_sine
                                WHERE   value >= GREATEST(piece * PI() + ASIN(0.4453), minv)
                                        AND value <= maxv
                                        AND SIN(value) > 0.4453
                                ORDER BY
                                        value
                                LIMIT 1
                                ),
                                maxv
                                )
                                AND value >= minv
                                AND SIN(value) <= 0.4453
                        ORDER BY
                                value DESC
                        LIMIT 1
                        )
                ELSE
                        (
                        SELECT  value
                        FROM    t_sine
                        WHERE   value <=
                                COALESCE(
                                (
                                SELECT  value
                                FROM    t_sine
                                WHERE   value >= GREATEST(piece * PI() - ASIN(0.4452), minv)
                                        AND value <= maxv
                                        AND SIN(value) < 0.4452
                                ORDER BY
                                        value
                                LIMIT 1
                                ),
                                maxv
                                )
                                AND value >= minv
                                AND SIN(value) >= 0.4452
                        ORDER BY
                                value DESC
                        LIMIT 1
                        )
                END AS ulimit
        FROM    d
        ) l
JOIN    t_sine s
ON      value BETWEEN llimit AND ulimit

View query details

This 200-line monster completes in only 110 ms, or 30 times as fast as the original 3-liner:

SELECT  *
FROM    t_sine
WHERE   SIN(value) BETWEEN 0.1234 AND 0.1235

, yielding the same results.

Summary

This example was to demonstrate feasibility of the B-Tree indexes to be used in a search for the predicates involving monotonic functions and the performance gain achieved.

The performance gain is over 30 times for a table that fits completely into the cache, and will increase with the number of the cache misses increases.

The SQL implementation of the algorithm is in fact not optimal, since iterative searches for the value boundaries are implemented as the subqueries. Each subquery requires reentering the B-Tree and traversing it starting from the root. The native algorithm working within the optimizer could avoid this by caching the key position in the index and issuing next_key / prev_key commands, which would improve the algorithm yet more.

Sargability of the monotonic functions, as shown above, can help to make the queries like the one described in this article much more legible, maintainable and efficient.

Written by Quassnoi

February 23rd, 2010 at 11:00 pm

Posted in PostgreSQL

2 Responses to 'Sargability of monotonic functions: example'

Subscribe to comments with RSS

  1. Why not use functional index?

    create index sin_idx on t_sin(sin(value));

    satek

    14 Oct 10 at 16:08

  2. @satek: because this way you will need a separate index for each function you are going to filter on in your query.

    The point is that you can filter on sin(value), as well as any other monotonic function, using an index on value.

    Read the original article: https://explainextended.com/2010/02/19/things-sql-needs-sargability-of-monotonic-functions/

    Quassnoi

    14 Oct 10 at 16:33

Leave a Reply