Archive for April 7th, 2010
Date ranges: overlapping with priority
Comments enabled. I *really* need your comment
Answering questions asked on the site.
Jason Foster asks:
We have a table of student registrations:
Students student_code course_code course_section session_cd 987654321 ESC102H1 Y 20085 998766543 ELEE203H F 20085
course_code
andcourse_section
identify a course,session_cd
is an academic session, e. g. 20085, 20091, 20079.The courses (stored in another table) have associated values for
engineering design,complementary studies, etc., like that:
Courses course_code course_section start_session end_session design science studies ESC102H1 F 20071 20099 10 0 0 AER201Y1 Y 20059 NULL 0 0 30 , or like that:
In-house courses course_code course_section student_code design science studies ESC102H1 F 998766543 10 0 0 We are required by an external accreditation body to add up all of the units of
engineering design,complementary studies, etc., taken by an individual student.Where it gets really messy is that we have multiple data feeds for the associated values of courses. For example we have a set from the Registrar's Office, the Civil Department, our In-House version, etc.
The rule is that In-House beats Civil beats the Registrar's Office in the case of any duplication within the overlapping intervals.
The
session_cd
is of the formYYYY{1,5,9}
.
Basically, we have three sets here.
To get the course hours for a given student we should find a record for him in the in-house set, or, failing that, find if the session is within the ranges of one of the external sets (Civil or Registrar). If both ranges contain the academic session the student took the course, Civil should be taken.
The first part is quite simple: we just LEFT JOIN
students with the in-house courses and get the hours for the courses which are filled. The real problem is the next part: searching for the ranges containing a given value.
As I already mentioned in the previous posts, relational databases are not in general that efficient for the queries like that. It's easy to use an index to find a value of a column within a given range, but B-Tree indexes are of little help in searching for a range of two columns containing a given value.
However, in this case, the data domain of session_cd
is quite a limited set. For each pair of session_start
and session_end
it is easy to create a set of all possible values between session_start
and session_end
.
The overlapping parts of the session ranges from the two sets will yields two records for each of the sessions belonging to the range. Of these two records we will need to take the relevant one (that is Civil) by using DISTINCT ON
with the additional sorting on the source (Civil goes first).
Then we just join the relevant records to the subset of the students
which does not have corresponding records in the in-house version.
Finally, we need to union this with the in-house recordset.
Read the rest of this entry »