出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:Arabic Multiword Term are relevant strings of words in text documents. Once they areautomatically extracted, they can be used to increase the performance of any text miningapplications such as Categorisation, Clustering, Information Retrieval System, MachineTranslation, and Summarization, etc. This paper introduces our proposed Multiword termextraction system based on the contextual information. In fact, we propose a new method baseda hybrid approach for Arabic Multiword term extraction. Like other method based on hybridapproach, our method is composed by two main steps: the Linguistic approach and theStatistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidateAMTWs. While in the second one which includes our main contribution, the Statistical approachincorporates the contextual information by using a new proposed association measure based onTermhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposedmethod for AMWTs extraction, this later has been tested and compared using three differentassociation measures: the proposed one named NTC-Value, NC-Value, and C-Value. Theexperimental results using Arabic Texts taken from the environment domain, show that ourhybrid method outperforms the other ones in term of precision, in addition, it can deal correctlywith tri-gram Arabic Multiword terms.
关键词:Multiword Term extraction;Part Of Speech; Categorisation; Clustering; Information Retrieval;Summarization.