期刊名称:International Journal of Software Engineering and Its Applications
印刷版ISSN:1738-9984
出版年度:2015
卷号:9
期号:3
页码:107-116
DOI:10.14257/ijseia.2015.9.3.11
出版社:SERSC
摘要:Building large relevance datasets is important for the training and evaluation of Information Retrieval (IR) systems. This process involves the collection of documents, queries and assessors' judgments of the degree of relevance of a query to a document. This process is expensive and time consuming. Additionally, it is not a one-of-a-kind project as it can be repeated for different languages and different corpora scopes and with different techniques. This paper presents a software engineering solution for the process of creating relevance corpora that achieves reusability, flexibility, multilingualism and modularity, in order to respect the experimental nature of IR field. The software engineering solution is presented as UML models. This paper then shows how the proposed design model was used to implement the process of building an open source relevance Arabic corpus based on the Clue Web 2009 data set for the purpose of supporting research evaluating and improving search engines for Arabic language.
关键词:Software Engineering Models; Information Retrieval; Relevance Corpus; ; Language Engineering