Data Mining : Problem #5
Problem #5 = Problem #3
Let's return to problem #3, sequence mining. The problem is the same (either course enrollments or wikipedia categories), but this time you need to think -- and report -- more about what you, how you do it, and why you do it like that.
Contents of the work
You are supposed to implement an algorithm. (You may continue and improve your previous one, and you may use extend existing implementations in a non-trivial way.) Your algorithms should be based on sequence mining, not on frequent itemset mining. Your algorithm should follow the spirit of frequent pattern mining as described in the book, i.e., to explore the search space in a scalable manner (e.g., breadth-first search like in Apriori, depth-first search, FP-growth, or other approaches described in the book). Shortcuts using brute-force methods should not be used, even if they may work out with these relative small data sets.
Alternatively, you may continue working on Problem #4 (species and environments). The general remarks and requirements hold. However, instead of sequence the emphasis is on clever use of continuous variables and hierarchies. The handling of at least either one should be built into the algorithm (especially candidate generation), not done simply as a pre-processing step.
Report
The report should define or at least describe in detail
- the pattern language used
- including a specification of when a pattern matches a data sequence
- the structure of the search space,
- including a specification of when a pattern is a subpattern of another one
- the candidate generation method (not the code, but the specification)
- including a justification for it, explaining issues such as coverage of all frequent patterns and efficiency
- the support counting procedure.
Pay attention that your implementation really works according to your specification!
Schedule
The specification part is submitted first, by Tue, 19 April. This is to help you focus on the problem first.
The rest of the report (including your solution to the problem and self-reflection of the group work) is submitted by Thu, 28 April.