The task of automatically generating captions for images has gained prominence over the last few years. Ideally, an image caption should capture objects in the image, and express how these objects relate to each other and the surroundings. Traditionally, captions are centered around a key object in the image, and talk about its relevance to the surroundings. For instance, for the image in figure 1, the state of the art Show, Attend and Tell [1] model generates the caption, ‘a plate of food on a table’. However, one might be interested in multiple objects present in the image, and might thus ask for specific demands, like dessert, knife and plate.
Solutions to this goal oriented captioning task are useful in various applications. For instance, visually impaired individuals might subsequently inquire about different objects from a general purpose caption. Further, not every object might be well represented in a general purpose caption, and hence allowing for different demands enables users to interact with images freely and holistically.
Thus, we propose the task of goal directed image captioning, which we formulate as follows:
Given an image and a demand word, generate a caption that is centered around the demand word and its interactions with the surroundings.
<img data-action="zoom" src="/projects/show_demand_and_tell.png" style="width:90%;"></img>
<figcaption>A motivating example, where we see that the state of the art Show, Attend and Tell model generates a general purpose caption, whereas our captioning model Show, Demand and Tell caters to various demands, like dessert, knife, plate and pastry.</figcaption>
For this new task, we create a dataset by processing MSCOCO image captioning dataset. Each image in MS-COCO dataset is associated with at least 5 captions, provided by human annotators. We would like to construct a supervised training set consisting of image, demand, demand-focused caption triples in order to train our model. For now, we restrict the demand word to be an object — a noun — present in the image. For a few MS COCO images, all the five captions are centered around the same object, and hence, are not very interesting to us. Rather, we are interested in cases where different captions are centered around different objects in the same image. Hence, we create a dataset by only selecting such images, and extracting the corresponding image, demand, caption triples.
Once we have the dataset, we extend current state of the art image captioning models [1][2] to incorporate the demand word input, and design an attention mechanism over the input demand word. Our model gen- erates captions that are centered around the demand word, and achieves an improvement of over 5 BLEU points over baseline methods.
(I will update the post with the details and architecture of our model, and describe in detail the dataset creation process)
[1]: Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International Conference on Machine Learning. 2015. [2]: Vinyals, Oriol, et al. “Show and tell: A neural image caption generator.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
-