Accepted Paper: Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Session 1: Deep Learning -- Day 2 (Nov.18), talks: 09:00-11:00 (5th floor Hall 1), poster session: 11:00-13:30
Poster number: Mon01

Authors

DifanZou (University of California, Los Angeles); Yuan Cao (UCLA); Dongruo Zhou (UCLA); Quanquan Gu (University of California, Los Angeles)

Abstract

We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work \citep{allen2018convergence,du2018gradientdeep} along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks.