Time-varying stochastic multi-armed bandit problems

Sattar Vakili, Qing Zhao, Yuan Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

In this paper, we consider a time-varying stochastic multi-armed bandit (MAB) problem where the unknown reward distribution of each arm can change arbitrarily over time. We obtain a lower bound on the regret order and demonstrate that an online learning algorithm achieves this lower bound. We further consider a piece-wise stationary model of the arm reward distributions and establish the regret performance of an online learning algorithm in terms of the number of change points experienced by the reward distributions over the time horizon.

Original languageEnglish
Title of host publicationConference Record of the 48th Asilomar Conference on Signals, Systems and Computers
EditorsMichael B. Matthews
Pages2103-2107
Number of pages5
ISBN (Electronic)9781479982974
DOIs
StatePublished - Apr 24 2015
Event48th Asilomar Conference on Signals, Systems and Computers, ACSSC 2015 - Pacific Grove, United States
Duration: Nov 2 2014Nov 5 2014

Publication series

NameConference Record - Asilomar Conference on Signals, Systems and Computers
Volume2015-April
ISSN (Print)1058-6393

Conference

Conference48th Asilomar Conference on Signals, Systems and Computers, ACSSC 2015
Country/TerritoryUnited States
CityPacific Grove
Period11/2/1411/5/14

Bibliographical note

Publisher Copyright:
© 2014 IEEE.

ASJC Scopus subject areas

  • Signal Processing
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Time-varying stochastic multi-armed bandit problems'. Together they form a unique fingerprint.

Cite this