首页    期刊浏览 2025年12月21日 星期日
登录注册

文章基本信息

  • 标题:How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources
  • 其他标题:How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources
  • 本地全文:下载
  • 作者:Kimmo Kettunen
  • 期刊名称:Digital Studies
  • 电子版ISSN:1918-3666
  • 出版年度:2020
  • 卷号:10
  • 期号:1
  • 页码:1-27
  • DOI:10.16995/dscn.315
  • 出版社:Open Library of Humanities
  • 摘要:The National Library of Finland has digitized and made available the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2014; Kettunen et al. 2014). This collection contains approximately 1.95 million pages in Finnish and Swedish. The Finnish part of the collection consists of about 2.40 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data package of the whole collection was released in early 2017 (Pääkkönen et al. 2016). Quality of OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess quality of large collections, but different methods can be used to approximate quality. This paper discusses different corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analysers, frequency analysis of words and comparisons to comparable edited lexical data. Our aim in the quality analysis is twofold: firstly to analyse the present state of the lexical data and secondly, to establish a set of assessment methods that build up a compact procedure for quality assessment after e.g. re-OCRing or postcorrection of the material.
  • 关键词:OCR quality; lexical quality estimation; 19th century Finnish newspaper collection
国家哲学社会科学文献中心版权所有