摘要:The textual data in the internet is increasing exponentially through blogs, twitter and various social media sites. The users are not specifying the type of text that they are uploading into the internet. In this regard most of the researchers are looking for automated tools for classifying the data or assigning class label to the unknown documents. Text classification is one such area used for classifying the texts. Several solutions were provided for text classification by the researchers. The text classification approaches generally contains collection of training data, preprocessing of the text, features extraction, feature reduction, document representation and finally applying classification algorithms to build the model for class label prediction of a new textual document. In the phases of text classification, the document representation is one important step to increase the efficiency of the accuracy of text classification. In this work, a new document representation approach is proposed. The experimentation conducted on 20-Newsgroup and Reuters-21578 datasets and different types of classification algorithms. Our approach attained best accuracy results for text classification and observed that the results are more promising than most of the popular approaches for text classification.
其他关键词:Accuracy, bag of words model, document representation, document weight measure, term weight measure, text classification.