期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2021
卷号:99
期号:18
语种:English
出版社:Journal of Theoretical and Applied
摘要:Text classification is a technique of assigning the known class label to the unknown textual documents. This technique assign single label or multiple labels to a specific document based on the content in the document. These techniques are used in various applications such as sentiment analysis, authorship analysis, fake news detection and spam email classification. In the text classification process, the words in the documents are considered as features. The most important words which are having more differentiating power are considered in the representation of a document. Identification of such words or features is a primary step in the classification process. The high dimensionality of data description is a primary issue in text classification. Huge number of features in the analysis not only decreases the performance of classification but also increase the computational time. In this work, a new feature selection technique based on Category specific Feature Distribution without Redundancy Information (CFDRI) is proposed to identify best informative features and eliminating the redundant features. The effectiveness of proposed feature selection technique is compared with existing techniques such as mutual information, information gain, chi square and relative discriminative criterion. The traditional Bag of Words technique is used to designate the documents as vectors. Term frequency and inverse document frequency measure is used to compute the vector value in the document vector representation. Various machine learning algorithms such as Decision Tree, Support Vector Machine, Na�ve Bayes, k-Nearest Neighbour, Logistic Regression and Random Forest are used to generate the learned model. Six popular text classification datasets are used in this experiment to train different learning algorithms. The proposed feature selection technique obtained best accuracies for text classification when compared with the popular solutions for text classification.
关键词:Text Classification;Feature Selection Techniques;Bag of Words Model;machine Learning Algorithms;Accura