摘要:Due to the rapid growth of deep learning technologies, automatic image description
generation is an interesting problem in computer vision and natural language generation. It helps
to improve access to photo collections on social media and gives guidance for visually impaired
people. Currently, deep neural networks play a vital role in computer vision and natural language
processing tasks. The main objective of the work is to generate the grammatically correct
description of the image using the semantics of the trained captions. An encoder-decoder
framework using the deep neural system is used to implement an image description generation
task. The encoder is an image parsing module, and the decoder is a surface realization module.
The framework uses Densely connected convolutional neural networks (Densenet) for image
encoding and Bidirectional Long Short Term Memory (BLSTM) for language modeling, and the
outputs are given to bidirectional LSTM in the caption generator, which is trained to optimize the
log-likelihood of the target description of the image. Most of the existing image captioning works
use RNN and LSTM for language modeling. RNNs are computationally expensive with limited
memory. LSTM checks the inputs in one direction. BLSTM is used in practice, which avoids the
problem of RNN and LSTM. In this work, the selection of the best combination of words in caption
generation is made using beam search and game theoretic search. The results show the game
theoretic search outperforms beam search. The model was evaluated with the standard
benchmark dataset Flickr8k. The Bilingual Evaluation Understudy (BL EU) score is taken as the
evaluation measure of the system. A new evaluation measure called Gcorrectwas used to check
the grammatical correctness of the description. The performance of the proposed model achieves
greater improvements over previous methods on the Flickr8k dataset. The proposed model
produces grammatically correct sentences for images with a Gcorrect of 0.040625 and a BLEU
score of 69.96%.