EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL: last 12 weeks of a year

Comments enabled. I *really* need your comment

From Stack Overflow:

I have a query like this:

SELECT  EXTRACT(WEEK FROM j.updated_at) as "week",  count(j.id)
FROM    jobs
WHERE   EXTRACT(YEAR FROM j.updated_at)=2009
GROUP BY
EXTRACT(WEEK FROM j.updated_at)
ORDER BY
week

, which works fine, but I only want to show the last 12 weeks.

LIMIT 12 works, but only gives me the first twelve and I need the order to be in sequential week order (i. e. not reversed) for charting purposes.

To select last 12 rows in ascending order it's enough to select first 12 rows in descending order and resort them in a subquery:

SELECT  *
FROM    (
SELECT  EXTRACT(week FROM updated_at) AS week, COUNT(*) AS cnt
FROM    jobs
WHERE   EXTRACT(year FROM updated_at) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week

However, this is rather inefficient. This query selects and aggregate all the year data just to fetch 12 last weeks.

More than that, EXTRACT(year FROM updated) is not a sargable predicate, and all table rows (or index rows) need to be examined.

On a sample table of 1,000,000 rows, this query runs for more that 3.5 seconds:

SELECT  *
FROM    (
SELECT  EXTRACT(week FROM updated) AS week, COUNT(*) AS cnt
FROM    t_week
WHERE   EXTRACT(year FROM updated) = 2008
GROUP BY
week
ORDER BY
week DESC
LIMIT 12
) q
ORDER BY
week

week cnt
41 3584
42 3467
43 3499
44 3535
45 3511
46 3621
47 3529
48 3500
49 3415
50 3536
51 3504
52 3479
12 rows fetched in 0.0004s (3.5843s)
Sort  (cost=58517.44..58517.47 rows=11 width=16)
  Sort Key: (date_part('week'::text, t_week.updated))
  ->  Limit  (cost=58517.11..58517.14 rows=11 width=8)
        ->  Sort  (cost=58517.11..58517.14 rows=11 width=8)
              Sort Key: (date_part('week'::text, t_week.updated))
              ->  HashAggregate  (cost=58516.75..58516.92 rows=11 width=8)
                    ->  Seq Scan on t_week  (cost=0.00..58491.75 rows=5000 width=8)
                          Filter: (date_part('year'::text, updated) = 2008::double precision)

However, this query can be easily improved. For each year, we can easily calculate the beginning and the end of each of last 12 weeks, and use these values in a more index-friendly query.

Let's create a sample table and see how to do this:

CREATE TABLE t_week (
id INT NOT NULL PRIMARY KEY,
updated TIMESTAMP NOT NULL,
value VARCHAR(300) NOT NULL
);

CREATE INDEX ix_week_updated ON t_week (updated);

SELECT  SETSEED(0.20090723);

INSERT
INTO    t_week
SELECT  s,
'2009-07-23'::TIMESTAMP - ((RANDOM() * 2000)::INTEGER || ' DAY')::INTERVAL,
RPAD('Value ' || s, 300, ' ')
FROM    generate_series(1, 1000000) s;

This table has 1,000,000 rows filled with random timestamps, and an index on updated.

To count rows for last 12 week in a year more efficiently, we can select a dummy rowsource of last 12 week numbers and just calculate COUNT in a subquery, using a range condition on updated.

By definition, ISO week always contains Jan 4th of each year. This is, EXTRACT(DOW FROM (year || '-01-04')::DATE) guaranteed to return 1 for any value of year.

This means that the date 7 days earlier, namely Dec 28th, is always inside the last ISO week of the year, and EXTRACT(DOW FROM (year || '-12-28')::DATE) will return us this last week number.

To get the last 12 weeks, we can run the following query:

SELECT  s
FROM    generate_series(EXTRACT(week FROM '2008-12-28'::DATE)::INTEGER - 11, EXTRACT(week FROM '2008-12-28'::DATE)::INTEGER) s
s
41
42
43
44
45
46
47
48
49
50
51
52

This rowset contains the numbers of 12 last ISO weeks in year 2008.

To counts rows corresponding to these weeks, we can select each COUNT inside a subquery which will use a range condition on updated to filter the values for the current week only.

Since week 1 always contains Jan 4th, we can subtract EXTRACT(DOW FROM s || '-01-03') from this date (which will give us the Monday of the week 1), and then just add a week number multiplied by 7 to get a range condition.

Here's the query to do this:

SELECT  s,
(
SELECT  COUNT(*)
FROM    t_week
WHERE   updated >= '2008-01-04'::DATE - EXTRACT(DOW FROM '2008-01-03'::DATE)::INTEGER + (s - 1) * 7
AND updated < '2008-01-04'::DATE - EXTRACT(DOW FROM '2008-01-03'::DATE)::INTEGER + s * 7
        )
FROM    generate_series(EXTRACT(week FROM '2008-12-28'::DATE)::INTEGER - 11, EXTRACT(week FROM '2008-12-28'::DATE)::INTEGER) s
s ?column?
41 3584
42 3467
43 3499
44 3535
45 3511
46 3621
47 3529
48 3500
49 3415
50 3536
51 3504
52 3479
12 rows fetched in 0.0004s (0.4969s)
Function Scan on generate_series s  (cost=0.00..14514875.04 rows=1000 width=4)
  SubPlan 1
    ->  Aggregate  (cost=14514.85..14514.86 rows=1 width=0)
          ->  Bitmap Heap Scan on t_week  (cost=127.66..14502.35 rows=5000 width=0)
                Recheck Cond: ((updated >= ('2007-12-31'::date + (($0 - 1) * 7))) AND (updated < ('2007-12-31'::date + ($0 * 7))))
                ->  Bitmap Index Scan on ix_week_updated  (cost=0.00..126.41 rows=5000 width=0)
                      Index Cond: ((updated >= ('2007-12-31'::date + (($0 - 1) * 7))) AND (updated < ('2007-12-31'::date + ($0 * 7))))

This query selects COUNT(*) using a range scan condition over each of the week numbers generated by generate_series.

This query need to examine much fewer rows and completes much faster (in only 0.5 seconds, or 7 times as fast as the original query)

Written by Quassnoi

July 23rd, 2009 at 11:00 pm

Posted in PostgreSQL

Leave a Reply