In this talk, I will overview recent results towards understanding how we learn large capacity machine learning models. In the modern practice of machine learning, especially deep learning, many successful models have far more trainable parameters compared to the number of training examples. Consequently, the optimization objective for training such models have multiple minimizers that perfectly fit the training data. More problematically, while some of these minimizers generalize well to new examples, most minimizers will simply overfit or memorize the training data and will perform poorly on new examples. In practice though, when such ill-posed objectives are minimized using local search algorithms like (stochastic) gradient descent ((S)GD), the "special" minimizers returned by these algorithms have remarkably good performance on new examples. In this talk, we will explore the role optimization algorithms like (S)GD in learning overparameterized models in simpler setting of learning linear predictors.