文章基本信息

标题：A linear response bandit problem
本地全文：下载
作者：Alexander Goldenshluger ; Assaf Zeevi
期刊名称：Stochastic Systems
印刷版ISSN：1946-5238
出版年度：2013
卷号：3
期号：1
页码：230-261
出版社：Institute for Operations Research and the Management Sciences (INFORMS), Applied Probability Society
摘要：We consider a two–armed bandit problem which involves sequen-tial sampling from two non-homogeneous populations. The responsein each is determined by a random covariate vector and a vectorof parameters whose values are not known a priori. The goal is tomaximize cumulative expected reward. We study this problem in aminimax setting, and develop rate-optimal polices that combine my-opic action based on least squares estimates with a suitable "forcedsampling" strategy. It is shown that the regret grows logarithmicallyin the time horizon n and no policy can achieve a slower growthrate over all feasible problem instances. In this setting of linear re-sponse bandits, the identity of the sub-optimal action changes withthe values of the covariate vector, and the optimal policy is sub-ject to sampling from the inferior population at a rate that growslike √n