Accepted Paper: Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching

Session 2: Multi-task Learning, NLP, Computer Vision, Applications -- Day 2 (Nov.18), talks: 09:00-11:00 (5th floor Hall 2), poster session: 11:00-13:30
Poster number: Mon37
Download paper

Authors

Zhuobin Zheng ( Tsinghua University); Youcheng Ben (Tsinghua University); Chun Yuan (Tsinghua University)

Abstract

Image-text matching has been a hot research endeavor recently. One promising direction is to infer fine-grained correspondences between visual instances and textual concepts, which makes learning instance-level visual features fundamental to this task. Detection-based approaches extract visual features directly from region proposals, but they are neither end-to-end learnable requiring extensive annotations nor adaptive to unseen instances. Attention-based approaches sequentially attend to different visual semantics in fixed time steps with global context as reference, but they are not flexible to handle situations when varying number of instances exist in different images. In this paper, we propose Self-Attention Visual-semantic Embeddings (SAVE), which aggregates instance-level semantics from all potential positions of the image in an end-to-end manner. Specifically, feature maps with spatial size kx k are first divided into k^2 instance candidates. For each instance candidate, we explore two variants of self-attention mechanisms to model its correlation with others and aggregate similar semantics, which exploits flexible spatial dependencies between distant regions. Furthermore, a multi-scale feature fusion technique is utilized to obtain different levels of semantic concepts for richer information from different representation scales. We evaluate our model on two benchmark datasets: MS-COCO and Flickr30K, which demonstrates both effectiveness and applicability of our method with favorably competitive performance as the state-of-the-art approaches.