Introduction to Machine Learning : Lectures

This page summarises briefly the contents of lectures.  The purpose is to help students who do not attend to keep track of what is going on.

"Lecture slides" means this set.  In page numbering, "page" means text book, "slide" means lecture slides.

 

Week 1

Lecture 1 (Tuesday 27 October) covered lecture slides 1–36.  This covered a brief introduction to the topic and matters related to the organisation of the course.  There was also a quiz about prerequisites.

Lecture 2 (Friday 30 October) covered slides 37–61.  Feel free to ignore the bit about chess unless you are personally interested.

Overall, the lectures for week 1 correspond to Prologue and Chapter 1 of the textbook.  (We didn't quite finish Chapter 1, the rest will be covered on next Tuesday.) However, there are some differences in how the material is presented.  In particular regarding Chapter 1, the lectures will present many topics more briefly.  Our aim is to get faster the part where we can start actually doing things with learning algorithms etc.  You should still read the entire Prologue and Chapter 1 to get perhaps a slightly different point of view.  There are some more technical topics to which we will return later during the course, so you perhaps should not spend too much time on them if they seem unclear.  Such topics include Bayes optimality and naive Bayes, 

In contrast, one topic that is presented in more detail in first week's lectures is distance and similarity measures, which will be the main topic of our first homework assignment.  Related to that, you should additionally read Section 8.1 of the text book.

 

Week 2

Lecture 3 (Tuesday 3 November) first finished Chapter1 of the textbook by covering models (Section 1.2).  However, except for geometric models, the lectures (slides 62–72) have much less detail than the text book.  We then covered the basic concepts of classification (Sections 2.0 and 2.1 in text book, slides 73–88).

Lecture 4 (Friday 6 November) concentrated on Bayes optimality and related topics (slides 89–101, and we also discussed some additional examples on blackboard).  Baeys optimality is introduced in the text book very briefly on pages 28–29, but we used more time on it based on previous experience.

 

Week 3

Lecture 5 (Tuesday 10 November) focussed mainly on scoring, ranking, and ROC curves and other visualisations of ranking performance (slides 102–117, pages 61–72). At the end of the lecture we started discussion on generalisation (the performance of a model on training set vs. test set), which as explained in the slides is not covered in any single point of the textbook (slides 118–126).

Lecture 6 (Friday 13 November) covered the rest of the discussion on generalisation (slides 127–152; however we skipped slides 130–131 and 151–152 which contained some more theoretical asides) .  Important practical points that will be further practiced in homework include using a separate validation set to get an unbiased estimate of model performance, and cross-validation as an efficient empirical tool for choosing the right model complexity.  We also discussed a couple of extra slides on Bayes optimality and error in the common case of binary classification with 0-1 loss when the instance distribution is continuous.

 

Week 4

Lecture 7 (Tuesday 17 November) covered decision trees (slides 154–173) and introduction to rule sets (slides 174–178). Comparing to textbook, we more or less covered pages 129–138 on decision trees. Trees in ranking and probabilistic prediction were discussed only briefly in the lecture. This corresponds mainly the basic idea of how to do this (last paragraph on page 141) and reduced error pruning (pages 142–143 and Algorithm 5.3, although this is not really about ranking).

Lecture 8 (Friday 20 November) covered the main part of rule sets (slides 179–186). Material covered in class basically corresponds to Sections 6.0, 6.1 and 6.2 in the textbook, but was on a bit more superficial level. In particular, we did not discuss coverage curve analysis, and probabilistic rules were introduced only in their most basic form. We then moved to linear models, of which we covered linear regression in univariate and multivariate cases (slides 187–202; pages 194–204).

 

Week 5

Lecture 9 (Tuesday 24 November) covered mainly linear classification, in particular the Perceptron algorithm (slides 203–220; pages 204–211). We also introduced the Pocket algorithm, which is not in textbook but will be needed in homework to use Perceptron on non-separable data.

Lecture 10 (Friday 27 November) finished up linear models by introducing the 1-vs-rest and 1-vs-1 methods of applying binary classification algorithms to multiclass tasks (slides 221–228).  Notice that the techniques can be used with any binary classifier, not just linear, and are in textbook on pages 82–84.  We started the section on probabilistic models with a detailed example on naive Bayes classification with categorical variables.  This corresponds to slides 229–237, although we will return to the general theory on next lecture.  In the textbook, this corresponds more or less to Sections 9.0 and 9.2, one major difference being that the textbook includes a lot of specifics about application to document classification (including the multinomial model) which we'll discuss on next lecture.

 

Week 6

Lecture 11 (Tuesday 1 December) had a quick review of general ideas of generative models and naive Bayes (slides 230–239) and details about naive Bayes in text classification (slides 240–242). We also discussed in some detail the properties of multivariate Gaussians (slides 243–249).

Lecture 12 (Friday 4 December) covered generative models and in particular naive Bayes with continuous input features (slides 250–255). We wrapped up the section on probabilistic models with logistic regression (256–261).  Altogether, our lectures on probabilistic models covered Sections 9.0 to 9.3 of the textbook, although we progressed in a different order (starting with categorical features), omitted some mathematical derivations (pages 272–273 and 284–285) and tried to give an alternative intuitive explanation of using multivariate Gaussians.

For clustering, we covered the general setting (slides 262–270) and introduced the K-means and K-medoids algorithms (slides 271–278, 283–284) and hierarchical clustering (slides 291–300). The gaps left in our coverage in clustering will be covered in Lecture 13.  Overall our discussion of clustering will follow rather closely the material in textbook Sections 8.4 and 8.5, although we are again progressing in a slightly different order.