期刊名称:International Journal of Advanced Research In Computer Science and Software Engineering
印刷版ISSN:2277-6451
电子版ISSN:2277-128X
出版年度:2013
卷号:3
期号:1
出版社:S.S. Mishra
摘要:Text mining is the process of discovering new, previously unknown information, from a usually large amount of different unstructured textual resources. Text Categorization is the task of assigning predefined categories to natural language text. This process of Text Categorization comes in preprocessing stage of Text Mining process. Feature can be a unit or weight assigned to represent a document. Feature Selection is a technique of selecting subset of features that best derives to characterize a document. Features for Text Categorization could be done with words, phrases or sentences that occur in training documents. Using bag of words, abundant information cannot be represented fully, since features selected may be redundant and irrelevant. By considering statistical methods, better features could be selected, that are dependent to a category. Moreover, position of the appearances of features plays a vital role in selecting good features. So, the distributional features, which include compactness of the appearances and position of the first appearance, had been incorporated on statistical methods. In this paper, performance had been evaluated by incorporating distributional features on statistical methods and compared with other feature selection techniques, for both words as well as phrases
关键词:Distributional features; Text Categorization; Data Mining; Text Mining and Statistical Methods