Data Mining : Problem #2

Problem description

The problem is essentially the same as Problem #1:

The Department of Computer Science has over the years accumulated a good amount of data about computer science courses taken by its students. While there are relatively clear recommended curricula, in addition to strict degree requirements, a general feeling is that students exercise their academic freedom and may deviate a lot from the recommendations. The department would like to know what the actual curricula taken by students are like. A sample of course registration data is available in Moodle in a simple and anonymous form.

Hints

Based on results for Problem #1, it is evident that frequent itemsets are not very useful as such. Some of them are simpily not interesting, many are redundant, and dependencies between courses are not clearly expressed. Support threshold or minimum and maximum sizes of frequent itemsets are not sufficient to find the most interesting or descriptive itemsets.

This time, consider the use of maximal frequent itemsets and closed frequent itemsets as compact representations of frequent itemsets. Also consider association rules to describe statistical dependencies between courses. Use confidence and some other interestingness measures to identify rules that are potentially more useful.

Again, try out different study methods, work out exercises, and experiment with the data! You may use an existing implementations or write your own code. If you write your own, avoid nasty details and pay attention to the general principles. Recall that there are no right solutions to the problem.

Additional material

Here are pointers to additional material related to the new concepts in this problem and issues discussed in the class.