摘要:Increasing concentrations of air pollutants is a global concern as it is a major underlying cause for other serious issues like premature deaths, global warming, increased susceptibility to heart diseases, lung disorders and skin disorders. Exposure to particulate pollutants increases vulnerability to Covid-19 and risk of succumbing to the virus. Air pollution analysis is a widely undertaken study by government officials and research scholars. K-means is a frequently used algorithm to understand the condition of the atmosphere from massive sensor generated data. The algorithm however comes with its drawbacks. Random initialization of the initial centroids can lead to bad clustering, an alternative, K-means++ does away with this, however, takes more execution time and iterations which is not ideal. We propose an advanced K-means++ initialization algorithm which incorporates an oversampling factor for smarter initialization of centroids using probability theory and weight assignment. We also propose a probability based convergence algorithm as opposed to the regular convergence algorithm to smartly select a portion of the data points to recompute the centroids. This will ensure a faster formation of final clusters. Real time Bengaluru, India air pollution data is scraped, pre-processed and clustered using the proposed technique. All the variants of K-means under study are compared over parameters of execution time, iterations and performance metrics. This work is also extended to tackle future air data points using a fast ensemble model. The solution proposed is better in terms of being reliable, fast and helps with better clustering, which leads to better air quality analysis, which leads to better air quality prediction, which leads to taking apt precautions to mitigate and regulate the air pollution.
关键词:This work is also extended to tackle future air data points using a fast ensemble model