期刊名称:International Journal of Database Management Systems
印刷版ISSN:0975-5985
电子版ISSN:0975-5705
出版年度:2014
卷号:6
期号:3
页码:53
DOI:10.5121/ijdms.2014.6304
出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they areautomatically extracted, they can be used to increase the performance of any Arabic Text Miningapplications such as Categorization, Clustering, Information Retrieval System, Machine Translation, andSummarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in threeapproaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present somedrawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not goodaccuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method forAMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: theLinguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS)Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statisticalfilter incorporate the contextual information, and a new proposed association measure based on Termhoodand Unithood Estimation named NTC-Value.To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparativestudy has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguisticfilter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method basedStatistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statisticalfilter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. Theobtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperformsthe other ones and can deal correctly with the tri-grams terms.
关键词:Multiword Terms extraction; contextual information; Part Of Speech; Termhood Estimation ; Unithood;Estimation