首页    期刊浏览 2025年02月21日 星期五
登录注册

文章基本信息

  • 标题:Developing an Arabic Plagiarism Detection Corpus
  • 本地全文:下载
  • 作者:Muazzam Ahmed Siddiqui ; Imtiaz Hussain Khan ; Kamal Mansoor Jambi
  • 期刊名称:Computer Science & Information Technology
  • 电子版ISSN:2231-5403
  • 出版年度:2014
  • 卷号:4
  • 期号:12
  • 页码:261-269
  • DOI:10.5121/csit.2014.41221
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:A corpus is a collection of documents. It is a valuable resource in linguistics research toperform statistical analysis and testing hypothesis for different linguistic rules. An annotatedcorpus consists of documents or entities annotated with some task related labels such as part ofspeech tags, sentiment etc One such task is plagiarism detection that seeks to identify if a givendocument is plagiarized or not. This paper describes our efforts to build a plagiarism detectioncorpus for Arabic. The corpus consists of about 350 plagiarized – source document pairs andmore than 250 documents where no plagiarism was found. The plagiarized documents consistsof students submitted assignments. For each of the plagiarized documents, the source documentwas located from the Web and downloaded for further investigation. We report corpus statisticsincluding number of documents, number of sentences and number of tokens for each of theplagiarized and source categories.
  • 关键词:Plagiarism detection; corpus linguistics; Arabic natural language processing; text mining
国家哲学社会科学文献中心版权所有