期刊名称:Journal of Emerging Technologies in Web Intelligence
印刷版ISSN:1798-0461
出版年度:2012
卷号:4
期号:3
页码:259-263
DOI:10.4304/jetwi.4.3.259-263
语种:English
出版社:Academy Publisher
摘要:Text classification is the task of assigning free text documents to some predefined groups. Many algorithms have been proposed; in particular, dimensionality reduction (DR) which is an important data pre-processing step has been extensively studied. DR can effectively reduce the features representation space which in turn helps improve the efficiency of text classification. Two DR methods namely Attribute Overlap Minimization (AOM) and Outlier Elimination (OE) are applied for downsizing the features representation space, on the numbers of attributes and amount of instances respectively, prior to training a decision model for text classification. AOM works by swapping the membership of the overlapped attributes (which are also known as features or keywords) to a group that has a higher occurrence frequency. Dimensionality is lowered when only significant and unique attributes are describing unique groups. OE eliminates instances that describe infrequent attributes. These two DR techniques can function with conventional feature selection together to further enhance their effectiveness. In this paper, two datasets on classifying languages and categorizing online news into six emotion groups are tested with a combination of AOM, OE and a wide range of classification algorithms. Significant improvements in prediction accuracy, tree size and speed are observed.
关键词:Data stream mining;optimized very fast decision tree;incremental optimization