首页    期刊浏览 2024年11月24日 星期日
登录注册

文章基本信息

  • 标题:Document representations for classification of short web-page descriptions
  • 本地全文:下载
  • 作者:Radovanović Miloš ; Ivanović Mirjana
  • 期刊名称:Yugoslav Journal of Operations Research
  • 印刷版ISSN:0354-0243
  • 电子版ISSN:1820-743X
  • 出版年度:2008
  • 卷号:18
  • 期号:1
  • 页码:123-138
  • DOI:10.2298/YJOR0801123R
  • 出版社:Faculty of Organizational Sciences, Belgrade, Mihajlo Pupin Institute, Belgrade, Economics Institute, Belgrade, Faculty of Transport and Traffic Engineering, Belgrade, Faculty of Mechanical Engineering, Belgrade
  • 摘要:

    Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .

  • 关键词:text categorization; document representation; machine learning
国家哲学社会科学文献中心版权所有