Data Mining : Problem #3
Problem description
This week you have a choice between two problems. Please choose one.
Problem 3A
An updated form of Problems #1 and #2:
The Department of Computer Science has over the years accumulated a good amount of data about computer science courses taken by its students. While there are relatively clear recommended curricula, in addition to strict degree requirements, a general feeling is that students exercise their academic freedom and may deviate a lot from the recommendations. The department would like to know what the actual curricula taken by students are like. A sample of course registration data is available in Moodle in a simple and anonymous form. A new version of the data contains the sequence of courses taken by each student, so the order of the courses is now available.
Problem 3B
Wikipedia (www.wikipedia.org) is a free online encyclopedia. Its entries can be tagged to belong to different categories. People, for instance, can belong to categories such as "members of the irish republican brotherhood" or "murdered educators". Given the open, collaborative nature of Wikipedia, it lacks a well-specified, rigid structure for the categories. Essentially, categories are just simple English language expressions. How could one find some structure (e.g. hierarchy) among a large set of category labels for people, such as those given in Moodle?
Hints
New concepts to be learned and used: sequential patterns and generalized Apriori (i.e., levelwise search).
Note that the course data has a more complex structure: it is not simply a sequence of courses, but a sequence of sets of courses (each set corresponds to courses taken parallel during one term). The course book covers this case. The wiki category data can be seen more simply as a sequence of words. Feel free to choose either one of these problems and settings.
Concepts you will need to refreshed and generalize: Apriori algorithm (or another frequent pattern algorithm) and candidate generation. How can their ideas be applied to mining frequent sequences? What is the search space? How can candidates be generated and pruned when dealing with sequences? How about subsequences (allowing gaps between items) versus substrings (no gaps) as sequential patterns?
It is also time to move to the third column of the learning objectives: implementing algorithms yourselves. When doing this, try to use a language and libraries that help you focus on the data mining algorithms, and avoid work on small details. There is no need to pay attention to efficiency here -- it is much more important that your program works right, and that you learn the essential things (cf. learning objectives!). (For instance, implementing efficient operations for handling sets or sequences is not a learning objective for this course, so avoid spending time on it. Spend your time on getting candidate generation right, instead.)