期刊名称:International Journal of Computer Science and Network Security
印刷版ISSN:1738-7906
出版年度:2020
卷号:20
期号:1
页码:106-116
出版社:International Journal of Computer Science and Network Security
摘要:Big Data Stream Analysis (BDS) has a pivotal role in the current computing revolution. The BDS possesses dynamic and continuously evolving behavior and may cause a change in data distribution arbitrarily over time. The phenomenon of change in data distribution over time is known as Concept Drift (CD). CD issue makes classical Machine Learning (ML) approaches in-effective, and ML approaches to need to be adapted to such change to maintain their performance accuracy. Also, CD detection and mitigation are two critical issues. Whereas, CD detection is a crucial pre-requisite of its mitigation, which aims to characterize and quantify CD by identifying change points from the Big Data input stream. Cur-rent CD detection techniques are based on Statistical Analysis and Data Distribution Analysis. However, these approaches do not provide a satisfactory way to differentiate between the concept of drift and noise. Furthermore, in the existing CD detection techniques, the optimize detection time and minimize the error rate is essential. Therefore, this research aims to propose a computational and performance effective concept drift approach. The proposed approach is divided into two modules Unsupervised and Supervised. In the Unsupervised module, the training data is clustered using K- Mean clustering, and the distance between their Centroids are compared with input data using the Cosine Distance. Whereas, in the Supervised module, the classification is performed using the ANN model. Later, the output observed from the Unsupervised and Supervised approaches makes the proposed model very advantageous (accurate). In this paper, we presented some initial experiments to determine Clusters and Centroids points, here we also find out the similarity between the Centroids and input data sample using the Cosine Distance formula. Finally, we did some experiments for the classification module to figure out the optimized classifier for the classification module. In future work, we will validate our proposed solution using the Synthetic and Real Concept Drifted Big Data Streams.
关键词:Big Data Analysis; Non-Stationary Environment; Drift Detection.