首页    期刊浏览 2024年11月28日 星期四
登录注册

文章基本信息

  • 标题:Building a DDC-annotated Corpus from OAI Metadata
  • 作者:Mathias Lösch ; Ulli Waltinger ; Wolfram Horstmann
  • 期刊名称:Journal of Digital Information
  • 印刷版ISSN:1368-7506
  • 电子版ISSN:1368-7506
  • 出版年度:2011
  • 卷号:12
  • 期号:2
  • 语种:English
  • 出版社:Texas A&M University Libraries
  • 摘要:Document servers complying to the standards of the Open Archives Initiative (OAI) are rich, yet seldom exploited source of textual primary data for research fields in text mining, natural language processing or computational linguistics. We present a bilingual (English and German) text corpus consisting of bibliographic OAI records and the associated full texts. A particular added value is that we annotated each record with at least one Dewey Decimal Classification (DDC) number, inducing a subject-based categorization of the corpus. By this means, it can be used as training data for machine learning-based text categorization tasks in digital libraries, but also as primary data source for linguistic research on academic language use related to specific disciplines. We describe the construction of the corpus using data from the Bielefeld Academic Search Engine (BASE), as well as its characteristics.
Loading...
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有