Spatial Temporal Enhanced Contrastive and Pretext Learning for Skeleton-based Action Representation

Yiwen Zhan (Beijing University of Posts and Telecommunications); Yuchen Chen (Beijing University of Posts and Telecommunications)*; Pengfei Ren (Beijing University of Posts and Telecommunications); Haifeng Sun (Beijing university of posts and telecommunications); Jingyu Wang (Beijing University of Posts and Telecommunications); Qi Qi (Beijing University of Posts and Telecommunications); Jianxin Liao (beijing university of posts and telecommunications)


In this paper, we focus on unsupervised representation learning for skeleton-based action recognition. The critical issue of this task is extracting discriminative spatial-temporal information from skeleton sequences to form action representation. To better solve this, we propose a novel unsupervised framework named contrastive-pretext spatial-temporal network (CP-STN), aiming to achieve accurate action recognition by better exploiting discriminative spatial-temporal enhanced features from massive unlabeled data. We combine contrastive and pretext tasks learning paradigms in one framework by using asymmetric spatial and temporal augmentations to enable network extracting discriminative representations with spatial-temporal information fully. Furthermore, graph-based convolution is used as the backbone to explore natural spatial-temporal graph information in skeleton data. Extensive experimental results show that our CP-STN signi´Čücantly boosts the performance of existing skeleton-based action representations learning networks and achieves state-of-the-art accuracy on two challenging benchmarks in both unsupervised and semi-supervised settings.