文章基本信息

标题：Noisy, Greedy and Not so Greedy k-Means++
本地全文：下载
作者：Anup Bhattacharya ; Jan Eube ; Heiko R{"o}glin 等
期刊名称：LIPIcs : Leibniz International Proceedings in Informatics
电子版ISSN：1868-8969
出版年度：2020
卷号：173
页码：18:1-18:21
DOI：10.4230/LIPIcs.ESA.2020.18
出版社：Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik
摘要：The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloydâs algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to DÂ²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of ð'ª(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to DÂ²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an ð'ª(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Î©(ð"â<.log k)-approximation in expectation where ð" is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by DÂ²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from DÂ²-sampling individually for each point by a factor between 1-Îµâ, and 1+Îµâ,, for parameters Îµâ, â^^ [0,1) and Îµâ,, â¥ 0. We prove that noisy k-means++ computes an ð'ª(logÂ² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.
关键词：k-means++; greedy; adaptive sampling