首页    期刊浏览 2025年07月01日 星期二
登录注册

文章基本信息

  • 标题:A Novel Method for Arabic Multi-Word Term Extraction
  • 本地全文:下载
  • 作者:Hadni Meryem ; Said Alaoui Ouatik ; Abdelmonaime Lachkar
  • 期刊名称:International Journal of Database Management Systems
  • 印刷版ISSN:0975-5985
  • 电子版ISSN:0975-5705
  • 出版年度:2014
  • 卷号:6
  • 期号:3
  • 页码:53
  • DOI:10.5121/ijdms.2014.6304
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they areautomatically extracted, they can be used to increase the performance of any Arabic Text Miningapplications such as Categorization, Clustering, Information Retrieval System, Machine Translation, andSummarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in threeapproaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present somedrawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not goodaccuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method forAMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: theLinguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS)Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statisticalfilter incorporate the contextual information, and a new proposed association measure based on Termhoodand Unithood Estimation named NTC-Value.To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparativestudy has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguisticfilter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method basedStatistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statisticalfilter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. Theobtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperformsthe other ones and can deal correctly with the tri-grams terms.
  • 关键词:Multiword Terms extraction; contextual information; Part Of Speech; Termhood Estimation ; Unithood;Estimation
国家哲学社会科学文献中心版权所有