- Session 1: Deep Learning -- Day 2 (Nov.18), talks: 09:00-11:00 (5th floor Hall 1), poster session: 11:00-13:30
- Poster number: Mon02
- Download paper
Wanming Huang (UTS); Richard Yi Da Xu (University of Technology, Sydney); Ian Oppermann (The Treasury)
Stochastic Gradient Descent (SGD) has been widely adopted in training Deep Neural networks of various structures. Instead of using a full dataset, a so-called mini-batch is selected during each gradient descent iteration. This aims to speed up the learning when a large number of training data is present. Without the knowledge of its true underlying distribution, one often samples the data indices uniformly. Recently, researchers applied a diversified mini-batch selection scheme through the use of Determinantal Point Process (DPP), in order to avoid having highly correlated samples in one batch (Zhang et al. (2017)). Despite its success, the attempts are restrictive in the sense that they used fixed features to construct the Gram-matrix for DPP; using the raw or fixed higher-layer features has limited the amount of potential improvement over the convergence rate. In this paper, we instead proposed to use variable higher-layer features which are updated at each iteration when the parameter changes. To avoid the high computation cost, several contributions have been made to speed up the computation of DPP sampling, including: (1) using hierarchical sampling to break down a single DPP sampling with large Gram-matrix into many DPP samplings of much smaller Gram-matrix and (2) using Markov k-DPP to encourage diversity across iterations. Empirical results show a much more diversified mini batch in each iteration in addition to a much improved convergence compared with the previous approach.