Jiayi Ji; Xiaoyang Huang; Xiaoshuai Sun; Yiyi Zhou; Gen Luo; Liujuan Cao; Jianzhuang Liu; Ling Shao; Rongrong Ji
Abstract:
Self-attention (SA) based networks have achieved great success in image captioning, constantly dominating the leaderboards of online benchmarks. However, existing SA networks still suffer from distance insensitivity and low-rank bottleneck. In this paper, we aim to optimize SA in terms of two aspects, thereby addressing the above issues. First, we propose a Distance-sensitive Self-Attention (DSA), which considers the raw geometric distances between key-value pairs in the 2D images during SA modeling. Second, we present a simple yet effective approach, named Multi-branch Self-Attention (MSA) to compensate for the low-rank bottleneck. MSA treats a multi-head self-attention layer as a branch and duplicates it multiple times to increase the expressive power of SA. To validate the effectiveness of the two designs, we apply them to the standard self-attention network, and conduct extensive experiments on the highly competitive MS-COCO dataset. We achieve new state-of-the-art performance on both the local and online test sets, i.e., 135.1% CIDEr on the Karpathy split and 135.4% CIDEr on the official online split. The source codes and trained models for all our experiments will be made publicly available.