In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

Here,
can be small.
The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

Polyak inequality
A special case of the Łojasiewicz inequality, due to Polyak, is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).
Definitions
is a function of type
, and has a continuous derivative
.
is the subset of
on which
achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value
exists, unless otherwise stated. The optimization objective is to find some point
in
.
are constants.
is
-Lipschitz continuous iff
is
-strongly convex iff
is
-PL (where "PL" means "Polyak-Łojasiewicz") iff
Basic properties
Theorem—1. If
is
-PL, then it is invex.
2. If
is L-Lipschitz continuous, then
3. If
is
-strongly convex then it is
-PL.
4. If
is
-strongly convex, and
is linear, then
is
-PL, where
is the smallest nonzero singular value of
.
5. (quadratic growth) If
is
-PL,
is a point, and
is the point on the optimum set that is closest to
in L2-norm, then
Proof
Proof
1. By definition, every stationary point is a global minimum.
2. Set
for
and use the
-Lipschitz continuity to show that
.
3. By definition,
. Now, minimize the left side, we have then minimize the right side, we have Combining the two, we have the
-PL inequality.
4.
Now, since
, we have
Set
to be the projection of
to the optimum subspace, then we have
. Thus, we have Vary the
on the right side to minimize the right side, we have the desired result.
5. Let
. For any
, we have so by
-PL,
In particular, we see that
is a vector field on
with size at least
. Define a gradient flow along
with constant unit velocity, starting at
:
Because
is bounded below by
, and
, the gradient flow terminates on the zero set
at a finite time The path length is
, since the velocity is constantly 1.
Since
is on the zero set, and
is the point closest to
, we have which is the desired result.
Gradient descent
Proof
Proof
Since
is
-Lipschitz, we have the parabolic upper bound
Plugging in the gradient descent step,
Thus,
Coordinate descent
The coordinate descent algorithm first samples a random coordinate
uniformly, then perform gradient descent by
Proof
Proof
By the same argument,
Take expectation with respect to
, we have
Plug in the
-PL inequality, we have Iterating the process, we have the desired result.
Stochastic gradient descent
In stochastic gradient descent, we have a function to minimize
, but we cannot sample its gradient directly. Instead, we sample a random gradient
, where
are such that For example, in typical machine learning,
are the parameters of the neural network, and
is the loss incurred on the
-th training data point, while
is the average loss over all training data points.
The gradient update step is where
are a sequence of learning rates (the learning rate schedule).
As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.
The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some
such that during the SG process, we have for all
, thenSimilarly, if then
Learning rate schedules
For constant learning rate schedule, with
, we haveBy induction, we have We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the
term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and
starts doing a random walk in the vicinity of
.
For decreasing learning rate schedule with
, we have
.
References
- Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS, 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, MR 0972342, S2CID 56006439
- Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society, 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, JSTOR 2153965, MR 1046016
- Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
- Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning. 59: 85–116. arXiv:2003.00307. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203.
External links