出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:A corpus is a collection of documents. It is a valuable resource in linguistics research toperform statistical analysis and testing hypothesis for different linguistic rules. An annotatedcorpus consists of documents or entities annotated with some task related labels such as part ofspeech tags, sentiment etc One such task is plagiarism detection that seeks to identify if a givendocument is plagiarized or not. This paper describes our efforts to build a plagiarism detectioncorpus for Arabic. The corpus consists of about 350 plagiarized – source document pairs andmore than 250 documents where no plagiarism was found. The plagiarized documents consistsof students submitted assignments. For each of the plagiarized documents, the source documentwas located from the Web and downloaded for further investigation. We report corpus statisticsincluding number of documents, number of sentences and number of tokens for each of theplagiarized and source categories.
关键词:Plagiarism detection; corpus linguistics; Arabic natural language processing; text mining