Multiarmed bandits in machine learning
The multi-armed bandit problem presents in a basic form the exploration-exploitation dilemma that is central in reinforcement learning.
Suppose you are standing in front of a long row of slot machines ("one-armed bandits"). You are given a finite supply of coins that you have to play. You get to keep all your winnings. The machines are not assumed to be identical, so playing on one machine might be much better than playing on another one, but initially you don't know anything about which machines are good and which are bad. In order to maximise your total winnings, how do you balance exploration (playing machines you haven't played yet) and exploitation (playing machines that up to now have given good results)?
Earlier the problem has been analysed from a statistical point of view. More recently algorithms have also been developed based on regret analysis as used in online machine learning, where probabilistic assumptions are avoided. One particular topic of recent interest is developing computationally efficient algorithms for the case where the number of "machines" is very large but there are some connections between the different "machines." For example, a "machine" represents a path through a weighted graph with unknown weights, and the amount you "win" by choosing a path is the sum of the weights of the edges of the path. Since different paths share edges, knowing the weights along one path also gives some information about many other paths.
The thesis would be mainly based on literature, which is often somewhat mathematical but does not require knowledge of advanced mathematics. Depending on the student's interest it may be possible to include some empirical comparisons of the algorithms.
Some references (mainly about non-statistical approaches):
- N. Cesa-Bianchi and G. Lugosi (2010). Combinatorial bandits. Submitted for journal publication.
- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, Robert E. Schapire (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1):48–77.
- Peter Auer (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3:397–422.