Data Mining : Problem #2
Problem Description
Essentially the same data and the same question, but let's consider more advanced concepts of data mining to try to obtain more useful results.
Based on results for Problem #1, it is evident that frequent itemsets are not very useful as such. Some of them are simply not interesting, many are redundant, and dependencies between courses are not clearly expressed. Support threshold or minimum and maximum sizes of frequent itemsets are not sufficient to select a reasonable number of most interesting or descriptive itemsets.
This time, consider the use of maximal frequent itemsets and closed frequent itemsets as compact representations of frequent itemsets. Also consider association rules to describe statistical dependencies between courses. Use confidence and some other interestingness measures to identify rules that are potentially more useful.
Again, try out different study methods, work out exercises, and experiment with the data! You may use an existing implementations or write your own code (that will definitely help you make sure you understood the algorithms), compare them. Try to avoid nasty details, especially if you write your own code, and pay attention to the general principles. Recall that there are no right solutions to the problem.
A new version of the dataset is available on the Moodle course page. It contains information about the program each the course, the years it was taught, level, etc. Have a look at the README file for more details.
Additional resources
Here are pointers to material related to the new concepts in this problem:
- Entries for Frequent Pattern, Apriori Algorithm, Association Rule, and Frequent Itemset in Encyclopedia of Machine Learning (http://www.cs.helsinki.fi/u/htoivone/pubs/)
- Slides on closed sets and generators: http://www.cs.helsinki.fi/u/htoivone/teaching/timuS02/closedsets.pdf
And to go deeper in the topic, you might be interested in looking at these additional resources:
- Finding maximal frequent itemsets: Roberto J. Bayardo Jr: Efficiently Mining Long Patterns from Databases. In Proceedings ACM SIGMOD International Conference on Management of Data, June 1998, Seattle, Washington, USA. 85-93.
- Generating "representative" association rules using closed itemsets: Marzena Kryszkiewicz: Closed Set Based Discovery of Representative Association Rules. In Advances in Intelligent Data Analysis, Lecture Notes in Computer Science 2189. Springer 2001. 350-359. (The link should be accessible from the university network.)