Accepted Paper: Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching

Back to list of accepted papers


Zhuobin Zheng ( Tsinghua University); Youcheng Ben (Tsinghua University); Chun Yuan (Tsinghua University)


Image-text matching has been a hot research endeavor recently. One promising direction is to infer fine-grained correspondences between visual instances and textual concepts, which makes learning instance-level visual features fundamental to this task. Detection-based approaches extract visual features directly from region proposals, but they are neither end-to-end learnable requiring extensive annotations nor adaptive to unseen instances. Attention-based approaches sequentially attend to different visual semantics in fixed time steps with global context as reference, but they are not flexible to handle situations when varying number of instances exist in different images. In this paper, we propose Self-Attention Visual-semantic Embeddings (SAVE), which aggregates instance-level semantics from all potential positions of the image in an end-to-end manner. Specifically, feature maps with spatial size kx k are first divided into k^2 instance candidates. For each instance candidate, we explore two variants of self-attention mechanisms to model its correlation with others and aggregate similar semantics, which exploits flexible spatial dependencies between distant regions. Furthermore, a multi-scale feature fusion technique is utilized to obtain different levels of semantic concepts for richer information from different representation scales. We evaluate our model on two benchmark datasets: MS-COCO and Flickr30K, which demonstrates both effectiveness and applicability of our method with favorably competitive performance as the state-of-the-art approaches.