期刊名称:International Journal of Innovative Research in Science, Engineering and Technology
印刷版ISSN:2347-6710
电子版ISSN:2319-8753
出版年度:2017
卷号:6
期号:6
页码:12259
DOI:10.15680/IJIRSET.2017.0606296
出版社:S&S Publications
摘要:As more and more applications deliver streaming data, clustering data streams has become ancrucialmethod for data and knowledge engineering. A normal approach is to summarize the data stream in real-timewith an online process into countless called micro-clusters. Micro-clusters represent local density estimates byassemble the information of many data points in a defined area. On request, a (modified) traditional clusteringalgorithm is used in a second offline step to recluster the microclusters into larger final clusters. For reclustering, thecoordinator of the micro-clusters is used as pseudo points with the density estimates used as their weights. However,information about density in the area between micro-clusters is not preserved in the online process and reclustering isbased on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g.,uniform or Gaussian). This paper depicts DBSTREAM, the first micro-cluster-based online clustering component thatexplicitly captures the density between micro-clusters via a shared density graph. The density information in this graphis then exploited for reclustering based on actual density between modified micro-clusters. We discuss the space andtime complexity of maintaining the shared density graph. Experiments on a wide range of artificial and real data setshighlight that using shared density improves clustering quality over other popular data stream clustering methods whichrequire the creation of a larger number of smaller microclusters to achieve comparable results.
关键词:Data mining; data stream clustering; density-based clustering; data stream clustering; density-based;clustering; ensemble clustering.