期刊名称:International Journal of Computer Science and Network Security
印刷版ISSN:1738-7906
出版年度:2017
卷号:17
期号:1
页码:18-24
出版社:International Journal of Computer Science and Network Security
摘要:A diversity of application fields include a massive number of datasets. Each dataset consists of a number of variables (features). One of these variables that is considered as a dependent variable (target variable) and is used for prediction in data mining of the supervised learning task. Data mining is necessary for building an automatic analysis in order to extract knowledge from datasets. Knowledge extraction is useful for recommendation system and decision making which can be accomplished by data mining tasks. Different data types and characteristics of dependent variable play an important role in selecting such a specific data mining task. One of the most challenging issues in the data mining research is selecting the most appropriate and perfect technique for a particular dataset. This paper proposes a supervised learning approach by utilizing k-means clustering in order to convert a regression task into a classification task. The proposed approach is a flexible data mining approach that employs variety techniques. The flexibility means that a dependent variable of a numeric data type in a dataset is not only considered for a regression task. Instead, the approach is also able to apply the same dataset in the classification task by categorizing dependent variable into class labels. The experimental results validate the application of the proposed approach using two datasets. The first dataset is CPU dataset from UCI repository datasets, while the second one is a road traffic dataset from a real-world domain. The results show the effectiveness of the proposed approach that integrates different techniques namely MLP, REPTree, and CART, which are widely used for both classification and regressions tasks. The results also demonstrate that by clustering the dependent variable from numeric values into class labels can produce high accuracy for the used datasets.