Data Mining : Problem #4

This is the last assignment for this course. Again your are given the choice between two problems.

Unlike previous assignment, this is individual. This means you are expected to work on your own. Yet, you are encouraged  to discuss the problem and ideas with fellow students.
 

Problem 4A: Finding frequent socio-economic patterns.

This problem uses data that was collected during past US census and contains a set of attributes describing individuals living in the US, for example, their age, profession, origin, icome. We would like to find frequent patterns over these socio-economic indicators.

The data can be found online:
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Some explanation about the attributes is provided at the very bottom of the following page:
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

The new thing to be leaned and used is frequent patterns with non-binary data:, especially categorical attributes and continuous attributes and their binarization/discretization, as well as hierarchies.

The data contains different types of attributes:

  • continuous numerical (e.g. capital),
  • categorical (e.g. occupation), 
  • attributes that can be organised into a hierarchy (e.g. native country, individual countries can be grouped by regions, then continents)

Here we are interested in how these attributes can be represented, using binarization and discretization and how the basic frequent itemset mining algorithms can be adapted to best mine patterns from this database.
You will need to implement some data preprocessing and adapt existing mining algorithm to take into account the specificities of the data generated by this preprocessing.

Problem 4B: Comparing frequent pattern mining algorithms.

The purpose of this problem is to compare different methods for finding frequent patterns.

Select a few algorithms among those that can be found online, for example on the FIMI repository:
http://fimi.ua.ac.be/src/

The algorithms should be comparable, in the sense that they have very similar pattern languages but use different methods for finding them.

You should present a detailed comparison of the method, the exact patterns they mine, how the search space is structured and explored, how do the performance vary.
The comparison should be theoretical as well as experimental. You should study the algorithms and run them on different datasets to compare their performances.
Common datasets can be found on the FIMI repository as well as on the UCI Machine Learning repository.
Of course, you are welcome to include an implementation of your own, for example from past problems, in the comparison...

 

Reporting.

Work this week is individual, so is reporting. You will need to submit two reports:  a technical report and a learning journal.

The technical report is the individual equivalent of previous weeks group report section 1, i.e. substance (see Reporting tab). Problem 4B requires really detailed and structured reporting on the compared methods, compared to Problem 4A, which involves more work on the implementation. In either case, the report should not be over two pages, figures and tables excluded. You will not have to give any oral presentation of the work, so try to be particularly clear and precise in written reporting. You must mention any collaborations with other students on solving the problem.

The expected content of personal learning journal is similar to previous weeks. In addition you should include a synthesis section where you list the important points you learnt during this course, summary of your observations about the organisation of the course, the group work, etc. Constructive criticism and suggestions for improvements are welcome. These summary points will be discussed and presented by groups during the last course.

The deadline for reporting is the evening before the last session, i.e. Wed. 25th.