文章基本信息

标题：Mining academic publications to automatically identify data sources
本地全文：下载
作者：Athanasios Anastasiou ; Karen Tingay
期刊名称：International Journal of Population Data Science
电子版ISSN：2399-4908
出版年度：2018
卷号：3
期号：2
页码：1-1
DOI：10.23889/ijpds.v3i2.532
出版社：Swansea University
摘要：BackgroundDiscovering suitable datasets is an important part of health research, particularly for projects working with cohort data, but with the proliferation of so many national and international initiatives, it is becoming increasingly difficult for research teams to locate real world datasets that are most relevant to their project objectives. MethodsTo assist researchers in this, we developed bibInsight, a data analysis platform to identify potentially useful data sources and more generally enable large scale research over bibliographical datasets. Data source names were identified from a broad, topic specific literature search. Context-specific terms like “annual”, “longitudinal”, and “prospective” were used to train a classifier that identified potential datasets. ResultsThe classifier was able to identify 1588 of 1961 abstracts as containing cohort-relevant information: a precision of approximately 80%. Further analysis such as topic analysis, geographical mapping, and collaboration networks can refine and prioritise the search results to determine the most relevant data source(s) for a research project. ConclusionsA very large amount of information, including data source description and use, remains unexploited in unstructured bibliographical datasets. Here, we used a thematic search to provide a more manageable starting point towards locating disease specific datasets.