文章基本信息

标题：REPRESENTING TEXT DOCUMENTS IN TRAINING DOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT REPRESENTATION
本地全文：下载
作者：ASMAA MOUNTASSIR ; HOUDA BENBRAHIM ; ILHAM BERRADA 等
期刊名称：Journal of Theoretical and Applied Information Technology
印刷版ISSN：1992-8645
电子版ISSN：1817-3195
出版年度：2013
卷号：56
期号：1
出版社：Journal of Theoretical and Applied
摘要：In this paper, we propose a novel model for Document Representation in an attempt to address the problem of huge dimensionality and vector sparseness that are commonly faced in Text Classification tasks. The proposed model consists of representing text documents in the space of training documents at a first stage. Afterward, the generated vectors are projected in a new space where the number of dimensions corresponds to the number of categories. To evaluate the effectiveness of our model, we focus on a problem of binary classification. We conduct our experiments on Arabic and English data sets of Opinion Mining. We use as classifiers Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN) which are known by their effectiveness in classical Text Classification tasks. We compare the performance of our model with that of the classical Vector Space Model (VSM) by the consideration of three evaluative criteria, namely dimensionality of the generated vectors, time (of learning and testing) taken by the classifiers, and classification results in terms of accuracy. Our experiments show that the effectiveness of our model (in comparison with the classical VSM) depends on the used classifier. Results yielded by k-NN when applying our model are better or as those obtained when applying the classical VSM. For SVM, results yielded when applying our model are in general, slightly lower than those obtained when using VSM. However, the gain in terms of time and dimensionality reduction is so promising since they are dramatically decreased by the application of our model.
关键词：Document Representation; Text Classification; Opinion Mining; Machine Learning; Natural Language Processing