Abstract
In this paper, we consider a time-varying stochastic multi-armed bandit (MAB) problem where the unknown reward distribution of each arm can change arbitrarily over time. We obtain a lower bound on the regret order and demonstrate that an online learning algorithm achieves this lower bound. We further consider a piece-wise stationary model of the arm reward distributions and establish the regret performance of an online learning algorithm in terms of the number of change points experienced by the reward distributions over the time horizon.
Original language | English |
---|---|
Title of host publication | Conference Record of the 48th Asilomar Conference on Signals, Systems and Computers |
Editors | Michael B. Matthews |
Pages | 2103-2107 |
Number of pages | 5 |
ISBN (Electronic) | 9781479982974 |
DOIs | |
State | Published - Apr 24 2015 |
Event | 48th Asilomar Conference on Signals, Systems and Computers, ACSSC 2015 - Pacific Grove, United States Duration: Nov 2 2014 → Nov 5 2014 |
Publication series
Name | Conference Record - Asilomar Conference on Signals, Systems and Computers |
---|---|
Volume | 2015-April |
ISSN (Print) | 1058-6393 |
Conference
Conference | 48th Asilomar Conference on Signals, Systems and Computers, ACSSC 2015 |
---|---|
Country/Territory | United States |
City | Pacific Grove |
Period | 11/2/14 → 11/5/14 |
Bibliographical note
Publisher Copyright:© 2014 IEEE.
ASJC Scopus subject areas
- Signal Processing
- Computer Networks and Communications