期刊名称:International Journal of Advanced Computer Research
印刷版ISSN:2249-7277
电子版ISSN:2277-7970
出版年度:2020
卷号:10
期号:49
页码:138-152
DOI:10.19101/IJACR.2020.1048037
出版社:Association of Computer Communication Education for National Triumph (ACCENT)
摘要:For the last three decades, the World Wide Web (WWW) has become one of the most widely used podium to generate an immense amount of heterogeneous data in a single day. Presently, many organizations aimed to process their domain data for taking quick decisions to improve their organizational performance. However, high dimensionality in datasets is a biggest obstacle for researchers and domain engineers to achieve their desired performance through their selected machine learning (ML) algorithms. In ML, feature selection is a core concept used for selecting most relevant features of high dimension data and thus improve the performance of the trained learning model. Moreover, the feature selection process also provides an effective way by eliminating in appropriate and redundant features and ultimately shrinks the computational time. Due to the significance and applications of feature selection, it has become a well-researched area of ML. Nowadays, feature selection has a vital role in most of the effective spam detection systems, pattern recognition systems, automated organization, management of documents, and information retrieval systems. In order to do accurate classification, the relevant feature selection is the most important task, and to achieve its objectives, this study starts with an overview of text classification. This overview is then followed by a survey. The survey covered the popular feature selection methods commonly used for text classification. This survey also sheds light on applications of feature selection methods. The focus of this study is three feature selection algorithms, i.e., Principal Component Analysis (PCA), ChiSquare (CS) and Information Gain (IG). This study is helpful for researchers looking for some suitable criterion to decide the suitable technique to be used for better understanding of the performance of the classifier. In order to conduct experiments, web spam uk2007 dataset is considered. Ten, twenty, thirty, and forty features were selected as an optimal subset from web spam uk2007 dataset. Among all three feature selection algorithms, CS and IG had highest F1Score (Fmeasure =0.911) but at the same time suffered with model building time.