Data Mining : Problem #1

Problem description

The Department of Computer Science has over the years accumulated a good amount of data about computer science courses taken by its students. While there are relatively clear recommended curricula, in addition to strict degree requirements, a general feeling is that students exercise their academic freedom and may deviate a lot from the recommendations. The department would like to know what the actual curricula taken by students are like. A sample of course registration data is available in Moodle in a simple and anonymous form.

Hints

The goal of this problem is to make you (very!) familiar with the concept of frequent itemset, including their search space, the Apriori algorithm for finding them, as well as practical use and implications of finding frequent itemsets in real data. Do a lot of different exercises, both on paper and with a computer (see below), to gain an understanding of the related concepts and phenomena, their behavior and effects. Look for and write down issues and problems that you run into, also possible approaches to solve them if you can think of any.

Study the problem and possible approaches to find solutions. Do NOT look for the right answer, as there is none. Really. This is largely an open problem, with different good answers but probably no single perfect answer. (The rest of the course will cover more concepts and methods suitable also for this problem.)

Practical advice

Use the Moodle link above to access the data, and also to return your report and journal.

Frequent itemset mining (using an implementation of the Eclat algorithm)
Obtain the Eclat Frequent Itemset Miner from Christian Borgelt's Web Page: http://www.borgelt.net/eclat.html

To see information on usage of eclat simply call it without parameters:
> ./eclat

Download some classical datasets from the FIMI repository (http://fimi.ua.ac.be/data/) to try it out.

All the datasets in the repository are in the input format accepted by eclat, i.e., one line per transaction containing the ids of the items present. Examples include "retail" and "chess" which contain market basket data and chess board positions respectively (more precise description of the data can be found online).

Try out other datasets as well, including your own.