首页    期刊浏览 2024年11月27日 星期三
登录注册

文章基本信息

  • 标题:Visual-Text Reference Pretraining Model for Image Captioning
  • 本地全文:下载
  • 作者:Pengfei Li ; Min Zhang ; Peijie Lin
  • 期刊名称:Computational Intelligence and Neuroscience
  • 印刷版ISSN:1687-5265
  • 电子版ISSN:1687-5273
  • 出版年度:2022
  • 卷号:2022
  • DOI:10.1155/2022/9400999
  • 语种:English
  • 出版社:Hindawi Publishing Corporation
  • 摘要:People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM.
国家哲学社会科学文献中心版权所有