摘要:In this paper, we consider the problem of searching for an optimal partition with the most appropriate number of clusters for an incomplete data set in which several outliers might occur. Special attention is given to the application of the Least Squares distance-like function. The procedure of preparing the incomplete data set and the outlier elimination procedure are proposed such that the clustering process gives acceptable solutions. Appropriate justifications with proof are provided for these procedures. An incremental algorithm for searching for optimal partitions with 2, 3, ... clusters is applied on the prepared data set. After that, by using the Davies-Bouldin and the Calinski Harabasz index the most appropriate number of clusters is determined. The whole procedure is organized as an algorithm given in the paper. In order to illustrate its applicability, the above steps are applied on the real data set of public buildings and their energy efficiency data, providing clear clusters that could be used for further modeling procedures.
关键词:clustering; incomplete data; missing data; optimal partition; energy efficiency of public buildings