Łojasiewicz inequality

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rⁿ, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

\operatorname {dist} (x,Z)^{\alpha }\leq C|f(x)|.

Here, $\alpha$ can be small.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

|f(x)-f(p)|^{\theta }\leq c|\nabla f(x)|.

Polyak inequality

A special case of the Łojasiewicz inequality, due to Polyak, is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).

Definitions

${\textstyle f}$ is a function of type ${\textstyle \mathbb {R} ^{d}\to \mathbb {R} }$ , and has a continuous derivative $\nabla f$ .

$X^{*}$ is the subset of $\mathbb {R} ^{d}$ on which $f$ achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value $f^{*}$ exists, unless otherwise stated. The optimization objective is to find some point $x$ in $X^{*}$ .

${\textstyle \mu ,L>0}$ are constants.

${\textstyle \nabla f}$ is $L$ -Lipschitz continuous iff

$\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,\quad \forall x,y$

${\textstyle f}$ is ${\textstyle \mu }$ -strongly convex iff $f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}\quad \forall x,y$

${\textstyle f}$ is ${\textstyle \mu }$ -PL (where "PL" means "Polyak-Łojasiewicz") iff ${\frac {1}{2}}\|\nabla f(x)\|^{2}\geq \mu \left(f(x)-f(x^{*})\right),\quad \forall x$

Basic properties

Theorem—1. If ${\textstyle f}$ is ${\textstyle \mu }$ -PL, then it is invex.

2. If ${\textstyle \nabla f}$ is L-Lipschitz continuous, then $f(y)\leq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}$

3. If ${\textstyle f}$ is ${\textstyle \mu }$ -strongly convex then it is ${\textstyle \mu }$ -PL.

4. If ${\textstyle g}$ is ${\textstyle \mu }$ -strongly convex, and ${\textstyle A}$ is linear, then ${\textstyle f:=g\circ A}$ is ${\textstyle (\mu \sigma ^{2})}$ -PL, where ${\textstyle \sigma }$ is the smallest nonzero singular value of ${\textstyle A}$ .

5. (quadratic growth) If ${\textstyle f}$ is ${\textstyle \mu }$ -PL, ${\textstyle x}$ is a point, and ${\textstyle x^{*}}$ is the point on the optimum set that is closest to ${\textstyle x}$ in L2-norm, then $f(x)\geq f\left(x^{*}\right)+{\frac {\mu }{2}}\left\|x-x^{*}\right\|_{2}^{2}$

Proof

Proof

1. By definition, every stationary point is a global minimum.

2. Set ${\textstyle g(t)=f(x+t(y-x))}$ for ${\textstyle t\in [0,1]}$ and use the ${\textstyle L}$ -Lipschitz continuity to show that ${\textstyle f(y)-f(x)=g(1)-g(0)=\int _{0}^{1}g'(t)=\langle \int _{0}^{1}\nabla f(x+t(y-x))dt,y-x\rangle \leq \langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}}$ .

3. By definition, ${\textstyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}}$ . Now, minimize the left side, we have $f(x^{*})\geq f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}$ then minimize the right side, we have $f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}\geq f(x)-{\frac {1}{2\mu }}\|\nabla f(x)\|^{2}$ Combining the two, we have the ${\textstyle \mu }$ -PL inequality.

$f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

4. $g(Ay)\geq g(Ax)+\langle \nabla g(Ax),Ay-Ax\rangle +{\frac {\mu }{2}}\|Ay-Ax\|^{2}$

Now, since ${\textstyle \nabla f(x)=A^{T}\nabla g(Ax)}$ , we have $f(y)\geq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {\mu }{2}}\|A(y-x)\|^{2}$

Set ${\textstyle y}$ to be the projection of ${\textstyle x}$ to the optimum subspace, then we have ${\textstyle \|A(y-x)\|\geq \sigma \|y-x\|}$ . Thus, we have $f(y)-f(x)\geq \langle \nabla f(x),y-x\rangle +{\frac {\mu \sigma ^{2}}{2}}\|y-x\|^{2}$ Vary the ${\textstyle y}$ on the right side to minimize the right side, we have the desired result.

5. Let ${\textstyle g(x):={\sqrt {f(x)-f^{*}}}}$ . For any ${\textstyle x\not \in X^{*}}$ , we have $\nabla g(x)={\frac {\nabla f(x)}{2{\sqrt {f(x)-f^{*}}}}}$ so by ${\textstyle \mu }$ -PL,
$\|\nabla g(x)\|^{2}\geq \mu /2$

In particular, we see that ${\textstyle \nabla g}$ is a vector field on ${\textstyle \mathbb {R} ^{d}\setminus X^{*}}$ with size at least ${\textstyle {\sqrt {\mu /2}}}$ . Define a gradient flow along ${\textstyle \nabla g}$ with constant unit velocity, starting at ${\textstyle x(0)=x}$ : $x(0)=x,\quad {\dot {x}}(t)={\frac {\nabla g}{\|\nabla g\|}}$

Because ${\textstyle g}$ is bounded below by ${\textstyle 0}$ , and ${\textstyle \|\nabla g\|\geq {\sqrt {\mu /2}}}$ , the gradient flow terminates on the zero set ${\textstyle X^{*}}$ at a finite time $T\leq g(x)/{\sqrt {\mu /2}}$ The path length is ${\textstyle T}$ , since the velocity is constantly 1.

Since ${\textstyle x(T)}$ is on the zero set, and ${\textstyle x^{*}}$ is the point closest to ${\textstyle x}$ , we have $\|x^{*}-x\|\leq T\leq g(x)/{\sqrt {\mu /2}}$ which is the desired result.

Gradient descent

Theorem (linear convergence of gradient descent)—If ${\textstyle f}$ is ${\textstyle \mu }$ -PL and ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz, then gradient descent with constant step size ${\textstyle \eta }$ $x_{k+1}=x_{k}-\eta \nabla f(x_{k})$ converges linearly as $f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right),\quad \eta \in (0,2/L)$

The convergence is the fastest when ${\textstyle \eta =1/L}$ , at which point $f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

Proof

Proof

Since ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz, we have the parabolic upper bound $f(x_{k+1})\leq f(x_{k})+\langle \nabla f(x_{k}),x_{k+1}-x_{k}\rangle +{\frac {L}{2}}\|x_{k+1}-x_{k}\|^{2}$

Plugging in the gradient descent step, ${\begin{aligned}f(x_{k+1})-f(x_{k})&\leq \langle \nabla f(x_{k}),-\eta \nabla f(x_{k})\rangle +{\frac {L}{2}}\|-\eta \nabla f(x_{k})\|^{2}\\&=(L\eta ^{2}/2-\eta )\|\nabla f(x_{k})\|^{2}\\&\leq 2\mu (L\eta ^{2}/2-\eta )\left(f(x_{k})-f(x^{*})\right)\end{aligned}}$

Thus, $f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

Corollary—1. ${\textstyle x_{k}}$ converges to the optimum set ${\textstyle X^{*}}$ at a rate of ${\textstyle \left(1-\mu \eta (2-L\eta )\right)}$ .

2. If ${\textstyle f}$ is ${\textstyle \mu }$ -PL, not constant, and ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz, then ${\textstyle L\geq \mu }$ .

3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

$f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

Coordinate descent

The coordinate descent algorithm first samples a random coordinate ${\textstyle i_{k}}$ uniformly, then perform gradient descent by $x_{k+1}=x_{k}-\eta \partial _{i_{k}}f(x_{k})e_{i_{k}}$

Theorem—Assume that ${\textstyle f}$ is ${\textstyle \mu }$ -PL, and that ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz at each coordinate, meaning that $|\partial _{i}f(x+te_{i})-\partial _{i}f(x)|\leq L|t|$ Then, ${\textstyle \mathbb {E} [f(x_{k})-f(x^{*})]}$ converges linearly for all ${\textstyle \eta \in (0,2/L)}$ by $\mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)^{k}(f(x_{0})-f(x^{*}))$

Proof

Proof

By the same argument, $f(x_{k+1})\leq f(x_{k})+(L\eta ^{2}/2-\eta )(\partial _{i_{k}}f(x_{k}))^{2}$

Take expectation with respect to ${\textstyle i_{k}}$ , we have $\mathbb {E} [f(x_{k+1})]\leq f(x_{k})+{\frac {L\eta ^{2}/2-\eta }{d}}\|\nabla f(x_{k})\|^{2}$

Plug in the ${\textstyle \mu }$ -PL inequality, we have $\mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)(f(x_{k})-f(x^{*}))$ Iterating the process, we have the desired result.

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize ${\textstyle f(x)}$ , but we cannot sample its gradient directly. Instead, we sample a random gradient ${\textstyle \nabla f_{i}(x)}$ , where ${\textstyle f_{i}}$ are such that $f(x)=\mathbb {E} _{i}[f_{i}(x)]$ For example, in typical machine learning, ${\textstyle x}$ are the parameters of the neural network, and ${\textstyle f_{i}(x)}$ is the loss incurred on the ${\textstyle i}$ -th training data point, while ${\textstyle f(x)}$ is the average loss over all training data points.

The gradient update step is $x_{k+1}=x_{k}-\eta _{k}\nabla f_{i_{k}}(x_{k})$ where ${\textstyle \eta _{k}>0}$ are a sequence of learning rates (the learning rate schedule).

Theorem—If each ${\textstyle \nabla f_{i}}$ is ${\textstyle L}$ -Lipschitz, ${\textstyle f}$ is ${\textstyle \mu }$ -PL, and ${\textstyle f}$ has global minimum ${\textstyle f^{*}}$ , then $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]$ We can also write it using the variance of gradient L2 norm: $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]$

Proof

Proof

Because all ${\textstyle \nabla f_{i}}$ are ${\textstyle L}$ -Lipschitz, so is ${\textstyle \nabla f}$ . We thus have $f(x_{k+1})\leq f(x_{k})-\eta _{k}\langle \nabla f(x_{k}),\nabla f_{i_{k}}(x_{k})\rangle +{\frac {L\eta _{k}^{2}}{2}}\|\nabla f_{i_{k}}(x_{k})\|^{2}$

Now, take the expectation over ${\textstyle i_{k}}$ , and use the fact that ${\textstyle f}$ is ${\textstyle \mu }$ -PL. This gives the first equation.

The second equation is shown similarly by noting that $\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]=\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]+\|\nabla f(x_{k})\|^{2}$

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some ${\textstyle C>0}$ such that during the SG process, we have $\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]\leq C$ for all ${\textstyle k=0,1,\dots }$ , then $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}$ Similarly, if $\forall k,\quad \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]\leq C$ then $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}$

Learning rate schedules

For constant learning rate schedule, with ${\textstyle \eta _{k}=\eta =1/L}$ , we have $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu /L\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {C}{2L}}$ By induction, we have $\mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]\leq \left(1-\mu /L\right)^{k}\left[f\left(x_{0}\right)-f^{*}\right]+{\frac {C}{2\mu }}$ We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the ${\textstyle C/(2L)}$ term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and ${\textstyle x_{k}}$ starts doing a random walk in the vicinity of ${\textstyle X^{*}}$ .

For decreasing learning rate schedule with ${\textstyle \eta _{k}=O(1/k)}$ , we have $\mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]=O(1/k)$ .

References

Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS, 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, MR 0972342, S2CID 56006439
Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society, 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, JSTOR 2153965, MR 1046016
Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning. 59: 85–116. arXiv:2003.00307. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203.

External links

"Lojasiewicz inequality", Encyclopedia of Mathematics, EMS Press, 2001 [1994]