Interpreting gradient methods as fixed-point iterations, we provide a detailed analysis of those methods for minimizing convex objective functions. Due to their conceptual and algorithmic simplicity, gradient methods are widely used in machine learning for massive data sets big data.
In particular, stochastic gradient methods are considered the de-facto standard for training deep neural networks. Studying gradient methods within the realm of fixed-point theory provides us with powerful tools to analyze their convergence properties. In particular, gradient methods using inexact or noisy gradients, such as stochastic gradient descent, can be studied conveniently using well-known results on inexact fixed-point iterations.
Moreover, as we demonstrate in this paper, the fixed-point approach allows an elegant derivation of accelerations for basic gradient methods.
In particular, we will show how gradient descent can be accelerated by Fixed-point method for validating the first-order approach fixed-point preserving transformation of an operator associated with the objective function.
One of the main recent trends within machine learning and data analytics using massive data sets is to leverage the inferential strength of the vast amounts of data by using relatively simple, but fast, optimization methods as algorithmic primitives [ 1 ]. Many of these optimization methods are modifications of the basic gradient descent GD method. Indeed, computationally more heavy approaches, such as interior point methods, are often infeasible for a given limited computational budget [ 2 ].
Moreover, the rise of deep learning has brought a significant boost for the interest in gradient methods. Indeed, a major insight within the theory of deep learning is that for typical high-dimensional models, e.
These local minima can be found efficiently by gradient methods such as stochastic gradient descent SGDwhich is considered the de-facto standard algorithmic primitive for training deep neural networks [ 3 ]. This paper elaborates on the interpretation of some basic gradient methods such as GD and its variants as fixed-point iterations. These fixed-point iterations are obtained for operators associated with the convex objective function.
Fixed-point method for validating the first-order approach the connection to fixed-point theory unleashes some powerful tools, e.
In particular, we detail how the convergence of the basic GD iterations can be understood from the contraction properties of a specific operator which is associated naturally with a differentiable objective function. Moreover, we work out in some detail how the basic GD method can be accelerated by modifying the operator underlying GD in a way that preserves its fixed-points but decreases the contraction factor which implies faster convergence by the contraction mapping theorem.
We discuss the basic problem of minimizing convex functions in Section 2. We then derive GD, which is a particular first order method, as a fixed-point iteration in Section 3. In Section 4we introduce one of the most widely used computational models for convex optimization methods, i. In order to assess the efficiency of GD, which is a particular instance of a first order method, we present in Section 5 a lower bound on the number of iterations required by any first order method to reach a given sub-optimality.
Using the insight provided from the fixed-point interpretation we show how to obtain an accelerated variant of GD in Section 6Fixed-point method for validating the first-order approach turns out to be optimal in terms of convergence rate. The spectral norm of a matrix M is denoted M: Given a convex function f xwe aim at finding a point x 0 with lowest function value f x 0i. In order to motivate our interest in optimization problems like Equation 1consider a machine learning problem based Fixed-point method for validating the first-order approach training data X: We wish to predict the label y i by a linear combination of the features, i.
Thus, the learning problem amounts to solving the optimization problem. The learning problem 3 is precisely of the form 1 with the convex objective function.
We will focus on a particular class of twice differentiable convex functions, i. In particular, we have Rudin [ 10Theorem 5. In particular, we can rewrite 9 as. The last summand in 10 quantifies the approximation error. Let us now verify that learning a regularized linear regression model cf.
The eigenvalues of the matrix Q LR obey [ 11 ]. Let us now show how one of the most basic methods for solving the problem 1i.
Our point of departure is the necessary and sufficient condition [ 7 ]. By tailoring a fundamental result of analysis cf. "Fixed-point method for validating the first-order approach," assume there would be two different fixed points xy such that.
Thus, we have shown that no two different fixed points can exist. The existence of one unique fixed point x 0 follows from Rudin [ 10Theorem 9. Here, we used in step a the mean value theorem of vector calculus [ 10Theorem 5. Combining 21 with the submultiplicativity of Euclidean and spectral norm [ 11p.
It will be handy to write out the straightforward combination of Lemma 2 and Lemma 3. Then, starting from an arbitrary initial guess x 0the iterates x k cf. According to Lemma 4and also illustrated in Figure 1starting from an arbitrary initial guess x 0the sequence x k generated by the fixed-point iteration 16 is guaranteed to converge to the unique solution x 0 of 1i.
Loosely speaking, this exponential decrease implies that the number of additional iterations required to have on more correct digit in x k is constant.