Data Mining : Problem #3

Problem description

This week you have a choice between two problems. Please choose one.

Problem 3A

Third and last version of the course data. This time the information about course registration is provided with temporal information, in the sense that it contains the sequence of courses each student enrolled to. So the order of the courses is now available.

Using this additional information, can you deepen your analysis of students enrollement to provide a better understanding of the type of curricula following? You are welcome to formulate your own question, i.e., define yourself precisely what patterns you are looking for in this dataset.

Problem 3B

Phrases extracted from blogs and news articles gathered over a year and grouped by month can be found online: www.memetracker.org/data.html#raw . This data can be analysed to identify catch phrases that have spread quickly around the web and study their evolution. This is called here <em>meme tracking</em>. A simpler version of meme tracking, or a first step toward that analysis would be to identify sequences of words that appear frequently within those phrases. The data is very large, you do not need to use the whole of it, nor to consider the full information, url, timestamp, etc. You can for example focus on the raw phrases (lines prefixed with Q) that appeared in April 2008 and try to find frequent word subsequences in that subset.

Hints

New concepts to be learned and used: sequential patterns and generalized Apriori (i.e., levelwise search).

Note that the course data has a more complex structure: it is not simply a sequence of courses, but a sequence of sets of courses (each set corresponds to courses taken parallel during one term). The course book covers this case. The wiki category data can be seen more simply as a sequence of words or a sequence of characters. Feel free to choose either one of these problems and settings.

Concepts you will need to refresh and generalize: Apriori algorithm (or another frequent pattern algorithm) and candidate generation. How can their ideas be applied to mining frequent sequences? What is the search space? How can candidates be generated and pruned when dealing with sequences? How about subsequences (allowing gaps between items) versus substrings (no gaps) as sequential patterns?

First define precisely the type of patterns you are looking for in the data, e.g. presence and length of gaps, total length of the sequences,  frequency count, maximality, etc. Then make sure your algorithm finds excatly the patterns you specified. After obtaining a set of patterns, you might want to refine you pattern specifications...

It is also time to move to the third column of the learning objectives: implementing algorithms yourselves. When doing this, try to use a language and libraries that help you focus on the data mining algorithms, and avoid work on small details. There is no need to pay attention to efficiency here -- it is much more important that your program works right, and that you learn the essential things (cf. learning objectives!). (For instance, implementing efficient operations for handling sets or sequences is not a learning objective for this course, so avoid spending time on it. Spend your time on getting candidate generation right, instead.)

 

Please sumbit your code together with the usual group report.