Date ranges: overlapping with priority at EXPLAIN EXTENDED

Date ranges: overlapping with priority

Comments enabled. I *really* need your comment

Answering questions asked on the site.

Jason Foster asks:

We have a table of student registrations:

Students
student_code course_code course_section session_cd

987654321 ESC102H1 Y 20085

998766543 ELEE203H F 20085

course_code and course_section identify a course, session_cd is an academic session, e. g. 20085, 20091, 20079.

The courses (stored in another table) have associated values for engineering design, complementary studies, etc., like that:

Courses
course_code course_section start_session end_session design science studies

ESC102H1 F 20071 20099 10 0 0

AER201Y1 Y 20059 NULL 0 0 30

, or like that:

In-house courses
course_code course_section student_code design science studies

ESC102H1 F 998766543 10 0 0

We are required by an external accreditation body to add up all of the units of engineering design, complementary studies, etc., taken by an individual student.

Where it gets really messy is that we have multiple data feeds for the associated values of courses. For example we have a set from the Registrar's Office, the Civil Department, our In-House version, etc.

The rule is that In-House beats Civil beats the Registrar's Office in the case of any duplication within the overlapping intervals.

The session_cd is of the form YYYY{1,5,9}.

Students
student_code	course_code	course_section	session_cd
987654321	ESC102H1	Y	20085
998766543	ELEE203H	F	20085

Courses
course_code	course_section	start_session	end_session	design	science	studies
ESC102H1	F	20071	20099	10	0	0
AER201Y1	Y	20059	NULL	0	0	30

In-house courses
course_code	course_section	student_code	design	science	studies
ESC102H1	F	998766543	10	0	0

Basically, we have three sets here.

To get the course hours for a given student we should find a record for him in the in-house set, or, failing that, find if the session is within the ranges of one of the external sets (Civil or Registrar). If both ranges contain the academic session the student took the course, Civil should be taken.

The first part is quite simple: we just LEFT JOIN students with the in-house courses and get the hours for the courses which are filled. The real problem is the next part: searching for the ranges containing a given value.

As I already mentioned in the previous posts, relational databases are not in general that efficient for the queries like that. It's easy to use an index to find a value of a column within a given range, but B-Tree indexes are of little help in searching for a range of two columns containing a given value.

However, in this case, the data domain of session_cd is quite a limited set. For each pair of session_start and session_end it is easy to create a set of all possible values between session_start and session_end.

The overlapping parts of the session ranges from the two sets will yields two records for each of the sessions belonging to the range. Of these two records we will need to take the relevant one (that is Civil) by using DISTINCT ON with the additional sorting on the source (Civil goes first).

Then we just join the relevant records to the subset of the students which does not have corresponding records in the in-house version.

Finally, we need to union this with the in-house recordset.

Pictures

Here's the same thing in pictures:

Within each source, courses are defined by the a single record holding the start and end session:
To find the courses superposition (regarding the priority) we split each range into a number of records, each corresponding to a single session, then combine these records in a singe recordset, ordered by session then by source:
From each session, we take a single record with the higher priority and use to to join with the students table:

Query

Now, let's create some sample tables and see how it works:

Table creation details

CREATE TABLE t_inhouse
        (
        course INT NOT NULL,
        student INT NOT NULL,
        hours1 INT NOT NULL, hours2 INT NOT NULL, hours3 INT NOT NULL,
        PRIMARY KEY (course, student)
        );

CREATE TABLE t_civil
        (
        id INT NOT NULL PRIMARY KEY,
        session_start INT NOT NULL, session_end INT NOT NULL,
        hours1 INT NOT NULL, hours2 INT NOT NULL, hours3 INT NOT NULL
        );

CREATE TABLE t_registrar
        (
        id INT NOT NULL PRIMARY KEY,
        session_start INT NOT NULL, session_end INT NOT NULL,
        hours1 INT NOT NULL, hours2 INT NOT NULL, hours3 INT NOT NULL
        );

CREATE TABLE t_student
        (
        id INT NOT NULL,
        course INT NOT NULL,
        session INT NOT NULL,
        PRIMARY KEY (id, course, session)
        );

SELECT  SETSEED(0.20100407);

INSERT
INTO    t_civil
SELECT  n,
        ((e / 3)) * 100 + (ARRAY[1, 5, 9])[e % 3 + 1],
        (((e + l) / 3)) * 100 + (ARRAY[1, 5, 9])[(e + l) % 3 + 1],
        CEILING(RANDOM() * 50)::INTEGER,
        CEILING(RANDOM() * 50)::INTEGER,
        CEILING(RANDOM() * 50)::INTEGER
FROM    (       
        SELECT  n,
                6000 + CEILING(RANDOM() * 10)::INTEGER AS e,
                CEILING(RANDOM() * 20)::INTEGER AS l
        FROM    generate_series(1, 200) n
        ) q;

INSERT
INTO    t_registrar
SELECT  n,
        ((e / 3)) * 100 + (ARRAY[1, 5, 9])[e % 3 + 1],
        (((e + l) / 3)) * 100 + (ARRAY[1, 5, 9])[(e + l) % 3 + 1],
        CEILING(RANDOM() * 50)::INTEGER,
        CEILING(RANDOM() * 50)::INTEGER,
        CEILING(RANDOM() * 50)::INTEGER
FROM    (       
        SELECT  n,
                6000 + CEILING(RANDOM() * 10)::INTEGER AS e,
                CEILING(RANDOM() * 20)::INTEGER AS l
        FROM    generate_series(1, 200) n
        ) q;

INSERT
INTO    t_inhouse
SELECT  n,
        s,
        CEILING(RANDOM() * 50)::INTEGER,
        CEILING(RANDOM() * 50)::INTEGER,
        CEILING(RANDOM() * 50)::INTEGER
FROM    (
        SELECT  n, generate_series(1, 100) s
        FROM    (
                SELECT  n
                FROM    generate_series(1, 200) n
                ) q
        ) q;

INSERT
INTO    t_student
SELECT  i, c,
        ((s / 3)) * 100 + (ARRAY[1, 5, 9])[s % 3 + 1]
FROM    (
        SELECT  *, 6000 + CEILING(RANDOM() * 30)::INTEGER AS s,
                RANDOM() AS rnd
        FROM    (
                SELECT  *,
                        generate_series(1, 200) c
                FROM    (       
                        SELECT  i
                        FROM    generate_series(1, 500) i
                        ) q
                ) q
        ) q
WHERE   rnd < 0.1
&#91;/sourcecode&#93;
</div>

The tables contain <strong>500</strong> students, random civil and registrar ranges for <strong>200</strong> courses, and <strong>20,000</strong> in-house records for first <strong>100</strong> students.

And here's the query (limited to return first <strong>10</strong> records for the sake of readability):


SELECT  1 AS sourse, s.id AS student, i.course AS course, hours1, hours2, hours3
FROM    t_student s
JOIN    t_inhouse i
ON      i.student = s.id
        AND i.course = s.course
UNION ALL
SELECT  source, s.id, q.id, hours1, hours2, hours3
FROM    t_student s
JOIN    (
        SELECT  DISTINCT ON (current_session, id) *
        FROM    (
                SELECT  *,
                        ((cs / 3)) * 100 + (ARRAY[1, 5, 9])[cs % 3 + 1] AS current_session
                FROM    (
                        SELECT  *,
                                generate_series(e, l) AS cs
                        FROM    (
                                SELECT  *,
                                        session_start * 3 / 100 + CASE (session_start % 100) WHEN 1 THEN 0 WHEN 5 THEN 1 ELSE 2 END AS e,
                                        session_end * 3 / 100 + CASE (session_end % 100) WHEN 1 THEN 0 WHEN 5 THEN 1 ELSE 2 END AS l
                                FROM    (
                                        SELECT  2 AS source, *
                                        FROM    t_civil
                                        UNION ALL
                                        SELECT  3 AS source, *
                                        FROM    t_registrar
                                        ) q
                                ) q
                        ) q
                ) q
        ORDER BY
                current_session, id, source
        ) q
ON      q.current_session = s.session
        AND q.id = s.course
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_inhouse ih
        WHERE   ih.student = s.id
                AND ih.course = s.course
        )
ORDER BY
        student, course
LIMIT 10

sourse	student	course	hours1	hours2	hours3
1	1	4	21	25	44
1	1	10	18	49	49
1	1	26	12	26	2
1	1	27	32	38	22
1	1	39	39	27	36
1	1	44	32	26	30
1	1	51	3	21	3
1	1	54	7	10	34
1	1	57	46	34	10
1	1	63	50	18	7
10 rows fetched in 0.0008s (0.0968s)

Limit  (cost=2323.47..2323.50 rows=10 width=20)
  ->  Sort  (cost=2323.47..2328.52 rows=2019 width=20)
        Sort Key: s.id, i.course
        ->  Append  (cost=817.31..2279.84 rows=2019 width=20)
              ->  Merge Join  (cost=817.31..1837.75 rows=1979 width=20)
                    Merge Cond: ((i.course = s.course) AND (i.student = s.id))
                    ->  Index Scan using t_inhouse_pkey on t_inhouse i  (cost=0.00..825.94 rows=20000 width=20)
                    ->  Sort  (cost=817.21..842.18 rows=9986 width=8)
                          Sort Key: s.course, s.id
                          ->  Seq Scan on t_student s  (cost=0.00..153.86 rows=9986 width=8)
              ->  Nested Loop Anti Join  (cost=362.94..421.91 rows=40 width=24)
                    ->  Hash Join  (cost=362.94..404.44 rows=50 width=28)
                          Hash Cond: ((((((q.cs / 3) * 100) + ('{1,5,9}'::integer[])[((q.cs % 3) + 1)])) = s.session) AND (q.id = s.course))
                          ->  Unique  (cost=59.29..62.29 rows=200 width=40)
                                ->  Sort  (cost=59.29..60.29 rows=400 width=40)
                                      Sort Key: ((((q.cs / 3) * 100) + ('{1,5,9}'::integer[])[((q.cs % 3) + 1)])), q.id, q.source
                                      ->  Subquery Scan q  (cost=0.00..42.00 rows=400 width=40)
                                            ->  Result  (cost=0.00..33.00 rows=400 width=56)
                                                  ->  Append  (cost=0.00..8.00 rows=400 width=56)
                                                        ->  Seq Scan on t_civil  (cost=0.00..4.00 rows=200 width=56)
                                                        ->  Seq Scan on t_registrar  (cost=0.00..4.00 rows=200 width=56)
                          ->  Hash  (cost=153.86..153.86 rows=9986 width=12)
                                ->  Seq Scan on t_student s  (cost=0.00..153.86 rows=9986 width=12)
                    ->  Index Scan using t_inhouse_pkey on t_inhouse ih  (cost=0.00..0.35 rows=1 width=8)
                          Index Cond: ((ih.course = s.course) AND (ih.student = s.id))

The query returns the required hours and the source of these hours for each of the courses a student attended.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question

Written by Quassnoi

April 7th, 2010 at 11:00 pm

Posted in PostgreSQL

Subscribe in a reader

Subscribe by email

Contacts

Ask me a question

Drop me a line

Should I?

Yes. Feel free to ask questions and write me. An interesting question is a pleasure to answer, and I really enjoy receiving feedback

EXPLAIN EXTENDED