出版社:Institute for Operations Research and the Management Sciences (INFORMS), Applied Probability Society
摘要:We consider a two–armed bandit problem which involves sequen-tial sampling from two non-homogeneous populations. The responsein each is determined by a random covariate vector and a vectorof parameters whose values are not known a priori. The goal is tomaximize cumulative expected reward. We study this problem in aminimax setting, and develop rate-optimal polices that combine my-opic action based on least squares estimates with a suitable "forcedsampling" strategy. It is shown that the regret grows logarithmicallyin the time horizon n and no policy can achieve a slower growthrate over all feasible problem instances. In this setting of linear re-sponse bandits, the identity of the sub-optimal action changes withthe values of the covariate vector, and the optimal policy is sub-ject to sampling from the inferior population at a rate that growslike √n