Project in Information-Theoretic Modeling : Details for problem 3

The data for problem 3 is "Shuttle (statlog)" from the UCI data repository. The file you are to produce is

shuttle.class

and you may use side information
shuttle.side

This is a classic classification data consisting of 58,000 instances, there are 7 classes and 9 numerical predictor variables. About 80% of all instances belong to class 1.  The data can be assumed to be i.i.d., as they are shuffled. Actually, time is one of the predictor variables. A brief description of the data set can be found at

http://archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29

One way of getting an idea about the data is to use B-Course, which can be found at

http://b-course.cs.helsinki.fi/obc/

B-Course requires a certain data format and has a 1M size constraint. Here is a truncated version of the data in suitable format

shuttle.bc

Standard methods such as Naive Bayes, logistic regression, SVM and so on should perform reasonably well, i.e., enable you to compress the data to a fraction of its original size. Therefore, simplicity of the model and its short encoding will play an important role.