文章基本信息

标题：Developing an Arabic Plagiarism Detection Corpus
本地全文：下载
作者：Muazzam Ahmed Siddiqui ; Imtiaz Hussain Khan ; Kamal Mansoor Jambi 等
期刊名称：Computer Science & Information Technology
电子版ISSN：2231-5403
出版年度：2014
卷号：4
期号：12
页码：261-269
DOI：10.5121/csit.2014.41221
出版社：Academy & Industry Research Collaboration Center (AIRCC)
摘要：A corpus is a collection of documents. It is a valuable resource in linguistics research toperform statistical analysis and testing hypothesis for different linguistic rules. An annotatedcorpus consists of documents or entities annotated with some task related labels such as part ofspeech tags, sentiment etc One such task is plagiarism detection that seeks to identify if a givendocument is plagiarized or not. This paper describes our efforts to build a plagiarism detectioncorpus for Arabic. The corpus consists of about 350 plagiarized – source document pairs andmore than 250 documents where no plagiarism was found. The plagiarized documents consistsof students submitted assignments. For each of the plagiarized documents, the source documentwas located from the Web and downloaded for further investigation. We report corpus statisticsincluding number of documents, number of sentences and number of tokens for each of theplagiarized and source categories.
关键词：Plagiarism detection; corpus linguistics; Arabic natural language processing; text mining