首页    期刊浏览 2025年04月08日 星期二
登录注册

文章基本信息

  • 标题:Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics
  • 本地全文:下载
  • 作者:Riza Batista-Navarro ; Rafal Rak ; Sophia Ananiadou
  • 期刊名称:Journal of Cheminformatics
  • 印刷版ISSN:1758-2946
  • 电子版ISSN:1758-2946
  • 出版年度:2015
  • 卷号:7
  • 期号:1
  • 页码:S6
  • DOI:10.1186/1758-2946-7-S1-S6
  • 语种:English
  • 出版社:BioMed Central
  • 摘要:The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules. Our evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools. The contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.
  • 关键词:Chemical named entity recognition ; Text mining ; Sequence labelling ; Conditional random fields ; Feature engineering ; Configurable workflows ; Workflow optimisation
国家哲学社会科学文献中心版权所有