文章基本信息
- 标题:Noisy, Greedy and Not so Greedy k-Means++
- 本地全文:下载
- 作者:Anup Bhattacharya ; Jan Eube ; Heiko R{"o}glin 等
- 期刊名称:LIPIcs : Leibniz International Proceedings in Informatics
- 电子版ISSN:1868-8969
- 出版年度:2020
- 卷号:173
- 页码:18:1-18:21
- DOI:10.4230/LIPIcs.ESA.2020.18
- 出版社:Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik
- 摘要:The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloydâs algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of ð'ª(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an ð'ª(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(ð"â<.log k)-approximation in expectation where ð" is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-εâ, and 1+εâ,, for parameters εâ, â^^ [0,1) and εâ,, ⥠0. We prove that noisy k-means++ computes an ð'ª(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.
- 关键词:k-means++; greedy; adaptive sampling