Data Mining : Problem #2
Problem description
The problem is essentially the same as Problem #1:
The Department of Computer Science has over the years accumulated a good amount of data about computer science courses taken by its students. While there are relatively clear recommended curricula, in addition to strict degree requirements, a general feeling is that students exercise their academic freedom and may deviate a lot from the recommendations. The department would like to know what the actual curricula taken by students are like. A sample of course registration data is available in Moodle in a simple and anonymous form.
Hints
Based on results for Problem #1, it is evident that frequent itemsets are not very useful as such. Some of them are simpily not interesting, many are redundant, and dependencies between courses are not clearly expressed. Support threshold or minimum and maximum sizes of frequent itemsets are not sufficient to find the most interesting or descriptive itemsets.
This time, consider the use of maximal frequent itemsets and closed frequent itemsets as compact representations of frequent itemsets. Also consider association rules to describe statistical dependencies between courses. Use confidence and some other interestingness measures to identify rules that are potentially more useful.
Again, try out different study methods, work out exercises, and experiment with the data! You may use an existing implementations or write your own code. If you write your own, avoid nasty details and pay attention to the general principles. Recall that there are no right solutions to the problem.
Additional material
Here are pointers to additional material related to the new concepts in this problem and issues discussed in the class.
- Finding maximal frequent itemsets: Roberto J. Bayardo Jr: Efficiently Mining Long Patterns from Databases. In Proceedings ACM SIGMOD International Conference on Management of Data, June 1998, Seattle, Washington, USA. 85-93.
- Generating "representative" association rules using closed itemsets: Marzena Kryszkiewicz: Closed Set Based Discovery of Representative Association Rules. In Advances in Intelligent Data Analysis, Lecture Notes in Computer Science 2189. Springer 2001. 350-359. (The link should be accessible from the university network.)
- Entries for Frequent Pattern, Apriori Algorithm, Association Rule, and Frequent Itemset in Encyclopedia of Machine Learning (http://www.cs.helsinki.fi/u/htoivone/pubs/)
- Slides on closed sets and generators: http://www.cs.helsinki.fi/u/htoivone/teaching/timuS02/closedsets.pdf