Archive for September, 2010
OLAP in MySQL: four ways to filter on higher level dimensions
Answering questions asked on the site.
Denis Kuzmenok asks:
I need some help with a query I'm fighting with.
I have these tables:
Product id parent name
Site id region name
Price id product site value For the products with a certain
product.parent
, I need to select minimal and maximal price within the givensite.region
, and return first 10 products ordered by the minimal price.Each
parent
has 100 to 10,000 products, and there are 10,000,000 records inprice
table.
We have a classical OLAP task here: a fact table (price
) and two dimension tables (product
and site
).
The task would be almost trivial if we were given the exact values of product
and site
. This way, we could build a composite index on these fields and value
, filter on the exact values and get the first 10 values from the index.
However, we are given not the values of the dimensions themselves but those of the higher levels (parent
for product
and region
for site
).
Since the values of the levels are not stored in the facts table, we cannot index them. We need to join the dimension tables and filter on them.
This article describes four ways to join the tables, their efficiency varying depending on the density of the dimension values.
Since the only algorithm to make the joins MySQL is capable of is nested loops, basically, we need to define the order in which the tables would be joined.
There are three tables and, hence, 3! = 6 permutations that define all possible join orders:
product | site | price |
product | price | site |
site | price | product |
site | product | price |
price | product | site |
price | site | product |
However, two dimension tables are completely independent on each other. This means that if they come one after another in the join their order does not actually matter: they both will be searched for independent values. This reduces the number of actual combinations:
product/site | price | |
product | price | site |
site | price | product |
price | product/site |
The joins must be designed so that the tables with most selective conditions go first. This means that the join order is determined by the density of the values satisfying the criteria. The more values from the table satisfy the search criteria, the later should the table come in the join, so that by the time the join occurs, most values would have been already sifted out.
Now, let's build the queries for all types of the join. To do this, we will create sample tables:
Read the rest of this entry »