Chapter 3: Convex Functions

Convex functions are the fundamental objects of study in convex optimization. This chapter develops the definition of convex functions, their properties, various characterizations (first-order, second-order, and gradient monotonicity), as well as operations that preserve convexity and related geometric concepts.

3.1 Operations That Preserve Convexity of Sets

Before studying convex functions per se, it is useful to review how convexity of sets can be established through operations that preserve it. Given a set $C \subseteq R^{n}$ , there are three practical approaches to verifying convexity:

Apply the definition directly:
$x_{1}, x_{2} \in C, 0 \leq θ \leq 1 ⟹ θ x_{1} + (1 - θ) x_{2} \in C$
Show that $C$ is obtained from simple convex sets (hyperplanes, halfspaces, norm balls, etc.) by operations that preserve convexity:
- Intersection
- Affine function
- Perspective function
- Linear-fractional function
Show that $C$ is a sublevel set of a convex function (discussed later).

3.1.1 Intersection

Property (Intersection of Convex Sets). The intersection of any number (including infinite) of convex sets is convex: if each $C_{i}$ is convex, then $⋂_{i \in I} C_{i}$ is convex.

Examples:

Affine space: An affine space defined by linear equalities can be expressed as an intersection of hyperplanes:
${x : A x = b} = ⋂_{i = 1}^{m} {x : a_{i}^{⊤} x = b_{i}}$
Each equality constraint is the intersection of two halfspaces ${x : a_{i}^{⊤} x \leq b_{i}}$ and ${x : a_{i}^{⊤} x \geq b_{i}}$ .
$ℓ_{\infty}$ -ball:
$B_{\infty} = {x : ∥ x ∥_{\infty} \leq 1} = ⋂_{i = 1}^{n} {x : - 1 \leq x_{i} \leq 1}$
Each constraint is a halfspace, so the intersection is convex.
PSD cone: The cone of positive semidefinite matrices
$S_{+}^{n} = {X ⪰ 0} = ⋂_{v \in R^{n}} {X : v^{⊤} X v \geq 0}$
is the intersection of infinitely many halfspaces.

3.1.2 Affine Mapping

Property (Affine Image). If $C$ is convex and $A \in R^{m \times n}$ , $b \in R^{m}$ , then the image under an affine map
${y = A x + b : x \in C}$
is convex.

Example (Scaling and Translation): Scaling and translating a convex set preserves convexity:

C = {x : ∥ x ∥_{2} \leq 1} ⟹ A C + b = {y : y = A x + b, ∥ x ∥_{2} \leq 1}

Interpretation: Linear transformations do not "bend" the set.

3.1.3 Inverse Affine Mapping

Property (Affine Preimage). If $S \subset R^{n}$ is convex and $f : R^{m} \to R^{n}$ is affine, the preimage is also convex:
$f^{- 1} (S) = {x \in R^{m} : f (x) \in S}$

Application: Linear Matrix Inequality (LMI). Consider

F (x) = F_{0} + \sum_{i = 1}^{n} x_{i} F_{i} ⪰ 0

The set of positive semidefinite matrices $S_{+}^{m} = {X \in S^{m} : X ⪰ 0}$ is convex. The mapping $x \mapsto F (x)$ is affine. Therefore, the solution set

{x : F (x) ⪰ 0} = F^{- 1} (S_{+}^{m})

is convex by the affine preimage property.

3.2 Fundamental Concepts in Set Topology

We briefly review essential topological concepts that are foundational in convex analysis.

A point $x \in R^{n}$ is an $n$ -tuple $(x_{1}, \dots, x_{n})$ .
A set $S \subseteq R^{n}$ is a collection of points.
The open ball (neighborhood) centered at $x$ with radius $ϵ$ : $B_{ϵ} (x) = {y \in R^{n} : ∥ y - x ∥ < ϵ}$

3.2.1 Open and Closed Sets

Open Set. Every point has a neighborhood contained in the set:
$\forall x \in S, \exists ϵ > 0 : B_{ϵ} (x) \subset S$
Example: $(0, 1)$ .

Closed Set. Contains all its limit points; its complement is open. Example: $[0, 1]$ .

3.2.2 Interior and Relative Interior

Interior. $int (S) = {x \in S : \exists ϵ > 0, B_{ϵ} (x) \subset S}$ Points with a neighborhood contained in $S$ . Example: $int ([0, 1]) = (0, 1)$ .

Relative Interior (relint). If $S$ lies in an affine subspace $A$ , then $x \in S$ is a relative interior point if
$B_{ϵ} (x) \cap A \subseteq S$
for some $ϵ > 0$ . The set of all such points is $relint (S)$ .
Example: The line segment $[0, 1]$ in $R^{2}$ has empty interior, but its relative interior is $(0, 1)$ within the line it spans.

3.2.3 Limit Points and Closure

Limit Point. Every neighborhood contains other points from the set:
$\forall ϵ > 0 : B_{ϵ} (x) \cap (S ∖ {x}) \neq \emptyset$
Points arbitrarily close to $x$ are in $S$ .

Closure. $cl (S)$ is $S$ together with all its limit points; it is the smallest closed set containing $S$ .
$cl ((0, 1)) = [0, 1]$

3.2.4 Boundary

Boundary. $\partial S = cl (S) ∖ int (S)$
Every neighborhood contains points from both $S$ and its complement.

Example: $S = (0, 1]$

$cl (S) = [0, 1]$ , $int (S) = (0, 1)$ , $\partial S = {0, 1}$

Key Relationships:

$cl (S) = int (S) \cup \partial S$
$int (S) \cap \partial S = \emptyset$
$int (S) \subseteq S \subseteq cl (S)$

3.3 Definition of Convex Functions

Definition (Convex Function). A function $f : C \to R$ defined on a convex set $C \subseteq R^{n}$ is called convex if
$\begin{matrix} (3.1) & f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y) \end{matrix}$
for all $x, y \in C$ and all $θ \in [0, 1]$ .

Definition (Strictly Convex). $f$ is strictly convex if the inequality is strict whenever $x \neq y$ and $θ \in (0, 1)$ :
$\begin{matrix} (3.2) & f (θ x + (1 - θ) y) < θ f (x) + (1 - θ) f (y) \end{matrix}$

Definition (Concave Function). $f$ is concave if $- f$ is convex, i.e.,
$\begin{matrix} (3.3) & f (θ x + (1 - θ) y) \geq θ f (x) + (1 - θ) f (y) \end{matrix}$

Geometric meaning: For a convex function $f$ , the graph lies below the chord connecting any two points. For a concave function, the graph lies above the chord.

Note: Linear (affine) functions are both convex and concave, but neither strictly convex nor strictly concave.

3.4 Common Examples of Convex Functions

Univariate examples:

$f (x) = a x + b$ (affine)
$f (x) = x^{2}$ , $f (x) = e^{a x}$ , $f (x) = | x |^{p}$ for $p \geq 1$
$f (x) = - \log x$ on $(0, \infty)$

Multivariate examples:

$f (x) = a^{⊤} x + b$ (affine)
$f (x) = x^{⊤} P x$ with $P ⪰ 0$ (quadratic)
$f (x) = ∥ x ∥$ (any norm)
$f (x) = max {x_{1}, \dots, x_{n}}$
$f (x) = \log (\sum_{i} e^{x_{i}})$ (log-sum-exp)

3.5 Convexity via Restriction to Lines

Theorem (Restriction to Lines). A function $f : C \to R$ with $dom f = C \subseteq R^{n}$ is convex if and only if for every $z, v \in R^{n}$ with ${z + t v : t \in R} \cap C \neq \emptyset$ , the one-dimensional function
$ϕ (t) = f (z + t v)$
is convex on its domain ${t : z + t v \in C}$ .

Intuition: This theorem reduces multivariate convexity to one-dimensional verification.

Examples:

$f (x, y) = x^{2} + y^{2} ⟶ g (t) = (z_{1} + t v_{1})^{2} + (z_{2} + t v_{2})^{2}$ (quadratic in $t$ )
$f (X) = - \log det X$ with $dom f = S_{+ +}^{n}$

Proposition. Let $C \subset R^{n}$ be convex and $f : C \to R$ . Then $f$ is convex if and only if for every line $L = {x + t v : t \in R}$ meeting $C$ , the restriction $ϕ (t) = f (x + t v)$ is convex on ${t : x + t v \in C}$ .

Proof.

$(\Rightarrow)$ If $f$ is convex, then for any $t_{1}, t_{2}$ with $x + t_{1} v, x + t_{2} v \in C$ and $λ \in [0, 1]$ ,

f (λ (x + t_{1} v) + (1 - λ) (x + t_{2} v)) \leq λ f (x + t_{1} v) + (1 - λ) f (x + t_{2} v),

which is the same as

ϕ (λ t_{1} + (1 - λ) t_{2}) \leq λ ϕ (t_{1}) + (1 - λ) ϕ (t_{2}),

so $ϕ$ is convex.

$(\Leftarrow)$ Conversely, let $x, y \in C$ and $θ \in [0, 1]$ . Define $ϕ (t) = f (y + t (x - y))$ . Then

f (θ x + (1 - θ) y) = ϕ (θ) \leq θ ϕ (1) + (1 - θ) ϕ (0) = θ f (x) + (1 - θ) f (y),

so $f$ is convex. $◻$

3.6 Epigraph and Sublevel Sets

3.6.1 Epigraph

Definition (Epigraph). For a function $f : R^{n} \to R$ , the epigraph of $f$ is the set
$epi f = {(x, t) \in R^{n + 1} : f (x) \leq t}$

Geometric meaning: The epigraph is the region on or above the graph of $f$ .

Key Property. $f$ is convex $⟺$ $epi f$ is a convex set.

Proposition. Let $f : R^{n} \to R \cup {+ \infty}$ and $epi f = {(x, t) \in R^{n + 1} : f (x) \leq t}$ . Then $f$ is convex on its domain if and only if $epi f$ is a convex set.

Proof.

$(\Rightarrow)$ Assume $f$ is convex. Take $(x_{1}, t_{1}), (x_{2}, t_{2}) \in epi f$ so $f (x_{i}) \leq t_{i}$ . For any $θ \in [0, 1]$ , set $\bar{x} = θ x_{1} + (1 - θ) x_{2}$ and $\bar{t} = θ t_{1} + (1 - θ) t_{2}$ . By convexity of $f$ ,

f (\bar{x}) \leq θ f (x_{1}) + (1 - θ) f (x_{2}) \leq θ t_{1} + (1 - θ) t_{2} = \bar{t},

hence $(\bar{x}, \bar{t}) \in epi f$ . Thus $epi f$ is convex.

$(\Leftarrow)$ Conversely, assume $epi f$ is convex. For any $x_{1}, x_{2} \in dom f$ and $θ \in [0, 1]$ , the points $(x_{i}, f (x_{i})) \in epi f$ , so their convex combination $(\bar{x}, \bar{t})$ with $\bar{x} = θ x_{1} + (1 - θ) x_{2}$ and $\bar{t} = θ f (x_{1}) + (1 - θ) f (x_{2})$ lies in $epi f$ . Therefore $f (\bar{x}) \leq \bar{t}$ , i.e.

f (θ x_{1} + (1 - θ) x_{2}) \leq θ f (x_{1}) + (1 - θ) f (x_{2}),

so $f$ is convex. $◻$

3.6.2 Sublevel Sets

Definition (Sublevel Set). For $f : R^{n} \to R$ and $α \in R$ , the $α$ -sublevel set is
$S_{α} = {x \in dom (f) : f (x) \leq α}$
the set of points where $f$ takes values $\leq α$ .

Key Property. If $f$ is convex, then all sublevel sets $S_{α}$ are convex.
Warning: The converse is not true. For example, $f (x) = x^{3}$ has convex sublevel sets but is not convex.

Examples:

$f (x) = x^{2}$ : $S_{α} = [- \sqrt{α}, \sqrt{α}]$ (convex intervals)
$f (x) = x^{3}$ : $S_{α} = (- \infty, \sqrt[3]{α}]$ (convex intervals, but $f$ is not convex)

3.7 Jensen's Inequality

Theorem (Jensen's Inequality — Discrete Form). If $f$ is convex and $θ_{i} \geq 0$ with $\sum_{i = 1}^{k} θ_{i} = 1$ , then
$\begin{matrix} (3.4) & f (\sum_{i = 1}^{k} θ_{i} x_{i}) \leq \sum_{i = 1}^{k} θ_{i} f (x_{i}) \end{matrix}$

Integral Form. For convex $f$ and a probability measure $p$ :
$f (\int x p (x) d x) \leq \int f (x) p (x) d x$

Expectation Form. For convex $f$ and a random variable $X$ :
$f (E [X]) \leq E [f (X)]$

3.8 First-Order Condition for Convexity

3.8.1 Gradient and Differentiability

Precondition: $f$ is differentiable if the gradient

\nabla f (x) = {(\frac{\partial f (x)}{\partial x_{1}}, \frac{\partial f (x)}{\partial x_{2}}, \dots, \frac{\partial f (x)}{\partial x_{n}})}^{⊤}

exists at each $x \in dom f$ .

Theorem (First-Order Condition). A differentiable function $f : R^{n} \to R$ is convex if and only if for all $x, y \in dom f$ :
$\begin{matrix} (3.5) & f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x) \end{matrix}$

Implication: If $\nabla f (x^{*}) = 0$ , then $x^{*}$ is a global minimum.

Example: $f (x) = x^{2}$ , $\nabla f (x) = 2 x$ :

f (y) = y^{2} \geq x^{2} + 2 x (y - x)

3.8.2 Proof of Equivalence

Direction 1: Basic convexity $\Rightarrow$ first-order condition.

Assume $f : R^{n} \to R$ is convex and differentiable. For any $x, y$ and $θ \in (0, 1]$ :

f (x + θ (y - x)) \leq (1 - θ) f (x) + θ f (y) .

Subtract $f (x)$ and divide by $θ > 0$ :

\frac{f (x + θ (y - x)) - f (x)}{θ} \leq f (y) - f (x) .

Take $θ \to 0$ . Differentiability implies:

lim_{θ \to 0} \frac{f (x + θ (y - x)) - f (x)}{θ} = \nabla f (x)^{⊤} (y - x) .

Hence:

f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x) .

Direction 2: First-order condition $\Rightarrow$ basic convexity.

Assume for all $x, y$ :

f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x) .

Let $z = θ x + (1 - θ) y$ with $θ \in [0, 1]$ . Apply the condition twice:

f (x) \geq f (z) + \nabla f (z)^{⊤} (x - z),

f (y) \geq f (z) + \nabla f (z)^{⊤} (y - z) .

Multiply the two inequalities by $θ$ and $1 - θ$ respectively and add:

θ f (x) + (1 - θ) f (y) \geq f (z) + \nabla f (z)^{⊤} (θ x + (1 - θ) y - z) .

Since $z = θ x + (1 - θ) y$ , we have:

f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y) .

3.9 Gradient Monotonicity

Definition (Monotone Gradient). A differentiable function $f : R^{n} \to R$ has a monotone gradient if for all $x, y \in dom f$ :
$\begin{matrix} (3.6) & ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq 0 \end{matrix}$

Connection to Convexity:

If $f$ is convex, its gradient is monotone.
Conversely, if $\nabla f$ is monotone, $f$ is convex.

Example: $f (x) = x^{⊤} Q x$ with $Q ⪰ 0$ :

\nabla f (x) = 2 Q x, ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ = 2 (x - y)^{⊤} Q (x - y) \geq 0

3.9.1 Proof of Equivalence between First-Order Condition and Gradient Monotonicity

Direction 1: First-order condition $\Rightarrow$ Gradient Monotonicity.

Assume $f : D \to R$ is differentiable and satisfies the first-order condition:

f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x), \forall x, y \in D .

Write this inequality twice, swapping $x$ and $y$ :

f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x),

f (x) \geq f (y) + \nabla f (y)^{⊤} (x - y) .

Adding the two inequalities yields:

0 \geq \nabla f (x)^{⊤} (y - x) + \nabla f (y)^{⊤} (x - y),

which simplifies to

⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq 0.

Direction 2: Gradient Monotonicity $\Rightarrow$ First-order condition.

Assume $f$ is differentiable and satisfies gradient monotonicity:

(\nabla f (y) - \nabla f (x))^{⊤} (y - x) \geq 0.

For any $x, y \in D$ , by the Fundamental Theorem of Calculus along the line segment:

f (y) - f (x) = \int_{0}^{1} \frac{d}{d t} f (x + t (y - x)) d t = \int_{0}^{1} \nabla f (x + t (y - x))^{⊤} (y - x) d t .

Subtract the tangent at $x$ :

f (y) - f (x) - \nabla f (x)^{⊤} (y - x) = \int_{0}^{1} (\nabla f (x + t (y - x)) - \nabla f (x))^{⊤} (y - x) d t \geq 0.

This gives the first-order condition:

f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x) .

3.10 Second-Order Condition for Convexity

Precondition: $f$ is twice-differentiable if $dom f$ is open and the Hessian

\nabla^{2} f (x)_{i j} = \frac{\partial^{2} f (x)}{\partial x_{i} \partial x_{j}}, i, j = 1, \dots, n,

exists at each $x \in dom f$ .

Theorem (Second-Order Condition). A twice-differentiable function $f : R^{n} \to R$ is convex if and only if its Hessian is positive semidefinite for all $x \in dom f$ :
$\begin{matrix} (3.7) & \nabla^{2} f (x) ⪰ 0 \end{matrix}$

Example: $f (x) = x^{⊤} Q x$ with $Q ⪰ 0$ :

\nabla f (x) = 2 Q x, \nabla^{2} f (x) = 2 Q ⪰ 0

3.10.1 Proof of Equivalence between Gradient Monotonicity and Second-Order Condition

Direction 1: Gradient Monotonicity $\Rightarrow$ Hessian PSD.

Assume $(\nabla f (y) - \nabla f (x))^{⊤} (y - x) \geq 0$ for all $x, y \in D$ . Consider an infinitesimal perturbation along direction $d$ : $y = x + ϵ d$ , $ϵ > 0$ . Gradient monotonicity implies:

(\nabla f (x + ϵ d) - \nabla f (x))^{⊤} (ϵ d) \geq 0 ⟹ \frac{(\nabla f (x + ϵ d) - \nabla f (x))^{⊤} d}{ϵ} \geq 0.

Take the limit as $ϵ \to 0$ :

d^{⊤} \nabla^{2} f (x) d \geq 0 \forall d \in R^{n} .

Direction 2: Hessian PSD $\Rightarrow$ Gradient Monotonicity.

Assume $f : D \to R$ is twice differentiable and $\nabla^{2} f (x) ⪰ 0$ for all $x \in D$ . For any $x, y \in D$ , consider the line segment $x + t (y - x)$ , $t \in [0, 1]$ :

\nabla f (y) - \nabla f (x) = \int_{0}^{1} \frac{d}{d t} \nabla f (x + t (y - x)) d t = \int_{0}^{1} \nabla^{2} f (x + t (y - x)) (y - x) d t .

Take inner product with $y - x$ :

(\nabla f (y) - \nabla f (x))^{⊤} (y - x) = \int_{0}^{1} (y - x)^{⊤} \nabla^{2} f (x + t (y - x)) (y - x) d t \geq 0.

3.11 Chain of Equivalences

For twice-differentiable functions, the following four statements are equivalent characterizations of convexity:

   Convexity
f(θx + (1−θ)y) ≤ θf(x) + (1−θ)f(y)
        ⇕
 First-order Condition                  Gradient Monotonicity
f(y) ≥ f(x) + ∇f(x)ᵀ(y−x)    ⇔    ⟨∇f(x)−∇f(y), x−y⟩ ≥ 0
        ⇕
 Second-order Condition
    ∇²f(x) ⪰ 0

In summary:

Convexity (Definition): $f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y)$
First-order Condition: $f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x)$
Gradient Monotonicity: $⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq 0$
Second-order Condition: $\nabla^{2} f (x) ⪰ 0$

Each provides a different practical tool for verifying or exploiting convexity in analysis and algorithm design.

3.12 Smooth and Strongly Convex Functions

Beyond mere convexity, many optimization algorithms rely on stronger structural assumptions—smoothness and strong convexity—to obtain quantitative convergence guarantees. This section develops L-smoothness (Lipschitz gradient) and m-strong convexity, their equivalent characterizations, and the convergence rates of gradient descent under these conditions.

3.12.1 L-Smooth Functions

Definition (L-smooth function). A differentiable function $f : R^{n} \to R$ is called L-smooth (or has an L-Lipschitz continuous gradient) for some $L > 0$ if for all $x, y \in R^{n}$ ,
$\begin{matrix} (3.8) & ∥ \nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥ . \end{matrix}$

Smoothness bounds how fast the gradient can change. Unlike convexity, it does not guarantee that stationary points are global minima; it only guarantees a quadratic upper bound on the function.

Chain of Equivalences. For a twice-differentiable function $f$ , the following four statements are equivalent characterizations of L-smoothness:

\begin{aligned} (I) & ∥ \nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥ (Lipschitz gradient) \\ (II) & g (x) = \frac{L}{2} ∥ x ∥^{2} - f (x) is convex \\ (III) & f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L}{2} ∥ y - x ∥^{2} (Descent Lemma) \\ (IV) & \nabla^{2} f (x) ⪯ L I (Hessian bound) \end{aligned}

We now prove the full cycle of equivalences.

Proof: (I) $\Rightarrow$ (III) (Lipschitz gradient $\Rightarrow$ Descent Lemma).

Using the Fundamental Theorem of Calculus along the line segment $x + t (y - x)$ :

f (y) - f (x) = \int_{0}^{1} ⟨ \nabla f (x + t (y - x)), y - x ⟩ d t .

Subtract the linear approximation at $x$ :

\begin{aligned} f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ & = \int_{0}^{1} ⟨ \nabla f (x + t (y - x)) - \nabla f (x), y - x ⟩ d t \\ \leq \int_{0}^{1} ∥ \nabla f (x + t (y - x)) - \nabla f (x) ∥ \cdot ∥ y - x ∥ d t \\ \leq \int_{0}^{1} L \cdot t ∥ y - x ∥ \cdot ∥ y - x ∥ d t \\ = L ∥ y - x ∥^{2} \int_{0}^{1} t d t = \frac{L}{2} ∥ y - x ∥^{2} . \end{aligned}

Thus $f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L}{2} ∥ y - x ∥^{2}$ . $◻$

Proof: (III) $\Rightarrow$ (II) (Descent Lemma $\Rightarrow$ $g$ convex).

Define $g (x) = \frac{L}{2} ∥ x ∥^{2} - f (x)$ . To show $g$ is convex, we verify the first-order condition. For any $x, y$ :

\begin{aligned} g (y) - g (x) - ⟨ \nabla g (x), y - x ⟩ & = [\frac{L}{2} ∥ y ∥^{2} - f (y)] - [\frac{L}{2} ∥ x ∥^{2} - f (x)] - ⟨ L x - \nabla f (x), y - x ⟩ \\ = \frac{L}{2} ∥ y ∥^{2} - \frac{L}{2} ∥ x ∥^{2} - L ⟨ x, y - x ⟩ - [f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩] \\ = \frac{L}{2} ∥ y - x ∥^{2} - [f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩] \\ \geq \frac{L}{2} ∥ y - x ∥^{2} - \frac{L}{2} ∥ y - x ∥^{2} = 0, \end{aligned}

where the inequality uses condition (III). Hence $g (y) \geq g (x) + ⟨ \nabla g (x), y - x ⟩$ , so $g$ is convex. $◻$

Proof: (II) $\Rightarrow$ (IV) (Convexity of $g$ $\Rightarrow$ Hessian bound).

If $g (x) = \frac{L}{2} ∥ x ∥^{2} - f (x)$ is convex and twice differentiable, then $\nabla^{2} g (x) ⪰ 0$ for all $x$ . Computing the Hessian:

\nabla^{2} g (x) = L I - \nabla^{2} f (x) .

Thus $L I - \nabla^{2} f (x) ⪰ 0$ , i.e., $\nabla^{2} f (x) ⪯ L I$ . $◻$

Proof: (IV) $\Rightarrow$ (I) (Hessian bound $\Rightarrow$ Lipschitz gradient).

Assume $\nabla^{2} f (x) ⪯ L I$ for all $x$ . For any $x, y$ , use the integral representation:

∥ \nabla f (y) - \nabla f (x) ∥ = ‖ \int_{0}^{1} \nabla^{2} f (x + t (y - x)) (y - x) d t ‖ .

Since for any vector $v$ , $∥ \nabla^{2} f (z) v ∥ \leq ∥ \nabla^{2} f (z) ∥_{op} \cdot ∥ v ∥$ and $\nabla^{2} f (z) ⪯ L I$ implies $∥ \nabla^{2} f (z) ∥_{op} \leq L$ , we have:

∥ \nabla f (y) - \nabla f (x) ∥ \leq \int_{0}^{1} ∥ \nabla^{2} f (x + t (y - x)) ∥_{op} \cdot ∥ y - x ∥ d t \leq L ∥ y - x ∥ .

Thus condition (I) holds. $◻$

This completes the chain of equivalences for L-smoothness.

3.12.2 Gradient Descent Convergence under Smoothness

Gradient descent updates:

\begin{matrix} (3.9) & x_{t + 1} = x_{t} - η \nabla f (x_{t}) . \end{matrix}

Under L-smoothness alone, we can guarantee that the gradient norm converges to zero.

Lemma (Descent Lemma for gradient descent). If $f$ is L-smooth and we take $η = \frac{1}{L}$ , then for each iteration:
$\begin{matrix} (3.10) & f (x_{t + 1}) - f (x_{t}) \leq - \frac{1}{2 L} ∥ \nabla f (x_{t}) ∥^{2} . \end{matrix}$

Proof. Applying the Descent Lemma (condition III) with $y = x_{t + 1} = x_{t} - \frac{1}{L} \nabla f (x_{t})$ :

\begin{aligned} f (x_{t + 1}) & \leq f (x_{t}) + ⟨ \nabla f (x_{t}), x_{t + 1} - x_{t} ⟩ + \frac{L}{2} ∥ x_{t + 1} - x_{t} ∥^{2} \\ = f (x_{t}) - \frac{1}{L} ∥ \nabla f (x_{t}) ∥^{2} + \frac{L}{2} \cdot \frac{1}{L^{2}} ∥ \nabla f (x_{t}) ∥^{2} \\ = f (x_{t}) - \frac{1}{2 L} ∥ \nabla f (x_{t}) ∥^{2} . \end{aligned}

Rearranging gives the claimed bound. $◻$

Theorem (Convergence to stationary point). Let $f$ be L-smooth and $f^{*} = inf_{x} f (x) > - \infty$ . After $T$ iterations of gradient descent with $η = \frac{1}{L}$ :
$\begin{matrix} (3.11) & min_{0 \leq t \leq T - 1} ∥ \nabla f (x_{t}) ∥^{2} \leq \frac{2 L (f (x_{0}) - f^{*})}{T} . \end{matrix}$
Consequently, $min_{t < T} ∥ \nabla f (x_{t}) ∥ \to 0$ at a rate $O (1 / \sqrt{T})$ .

Proof. Summing the Descent Lemma over $t = 0, 1, \dots, T - 1$ :

\begin{aligned} \sum_{t = 0}^{T - 1} (f (x_{t}) - f (x_{t + 1})) & \geq \frac{1}{2 L} \sum_{t = 0}^{T - 1} ∥ \nabla f (x_{t}) ∥^{2} . \end{aligned}

The left-hand side telescopes:

f (x_{0}) - f (x_{T}) \geq \frac{1}{2 L} \sum_{t = 0}^{T - 1} ∥ \nabla f (x_{t}) ∥^{2} \geq \frac{T}{2 L} min_{0 \leq t \leq T - 1} ∥ \nabla f (x_{t}) ∥^{2} .

Since $f (x_{T}) \geq f^{*}$ ,

\frac{T}{2 L} min_{t < T} ∥ \nabla f (x_{t}) ∥^{2} \leq f (x_{0}) - f^{*},

which rearranges to the stated bound. $◻$

This shows gradient descent converges to a stationary point (where $\nabla f \approx 0$ ) at a rate $O (1 / T)$ .

3.12.3 Smooth + Convex Convergence

If $f$ is not only L-smooth but also convex, we obtain convergence to a global minimum (not just a stationary point).

Theorem (Convergence under smoothness + convexity). Let $f$ be L-smooth and convex, and let $x^{*}$ be a global minimizer of $f$ . After $T$ iterations of gradient descent with $η = \frac{1}{L}$ :
$\begin{matrix} (3.12) & f (x_{T}) - f (x^{*}) \leq \frac{L}{2 T} ∥ x_{0} - x^{*} ∥^{2} . \end{matrix}$

Proof. Write $η = \frac{1}{L}$ . For each iteration:

\begin{aligned} ∥ x_{t + 1} - x^{*} ∥^{2} & = ∥ x_{t} - η \nabla f (x_{t}) - x^{*} ∥^{2} \\ = ∥ x_{t} - x^{*} ∥^{2} - 2 η ⟨ \nabla f (x_{t}), x_{t} - x^{*} ⟩ + η^{2} ∥ \nabla f (x_{t}) ∥^{2} . \end{aligned}

By convexity (first-order condition), $f (x_{t}) - f (x^{*}) \leq ⟨ \nabla f (x_{t}), x_{t} - x^{*} ⟩$ . Hence:

- 2 η ⟨ \nabla f (x_{t}), x_{t} - x^{*} ⟩ \leq - 2 η (f (x_{t}) - f (x^{*})) .

Using the Descent Lemma (3.10), $η^{2} ∥ \nabla f (x_{t}) ∥^{2} = \frac{1}{L^{2}} ∥ \nabla f (x_{t}) ∥^{2} \leq \frac{2}{L} (f (x_{t}) - f (x_{t + 1}))$ . Substituting:

\begin{aligned} ∥ x_{t + 1} - x^{*} ∥^{2} & \leq ∥ x_{t} - x^{*} ∥^{2} - \frac{2}{L} (f (x_{t}) - f (x^{*})) + \frac{2}{L} (f (x_{t}) - f (x_{t + 1})) \\ = ∥ x_{t} - x^{*} ∥^{2} - \frac{2}{L} (f (x_{t + 1}) - f (x^{*})) . \end{aligned}

Let $Δ_{t} = f (x_{t}) - f (x^{*})$ . Then $∥ x_{t + 1} - x^{*} ∥^{2} \leq ∥ x_{t} - x^{*} ∥^{2} - \frac{2}{L} Δ_{t + 1}$ , which implies:

Δ_{t + 1} \leq \frac{L}{2} (∥ x_{t} - x^{*} ∥^{2} - ∥ x_{t + 1} - x^{*} ∥^{2}) .

Summing this inequality from $t = 0$ to $T - 1$ telescopes:

\sum_{t = 1}^{T} Δ_{t} \leq \frac{L}{2} (∥ x_{0} - x^{*} ∥^{2} - ∥ x_{T} - x^{*} ∥^{2}) \leq \frac{L}{2} ∥ x_{0} - x^{*} ∥^{2} .

Since $Δ_{t}$ is non-increasing (the Descent Lemma guarantees $f (x_{t + 1}) \leq f (x_{t})$ ), we have $T \cdot Δ_{T} \leq \sum_{t = 0}^{T - 1} Δ_{t} \leq \frac{L}{2} ∥ x_{0} - x^{*} ∥^{2}$ . Therefore:

f (x_{T}) - f (x^{*}) \leq \frac{L}{2 T} ∥ x_{0} - x^{*} ∥^{2} . ◻

This is the $O (1 / T)$ rate for smooth convex optimization.

3.12.4 m-Strongly Convex Functions

Definition (m-strongly convex function). A differentiable function $f : R^{n} \to R$ is called m-strongly convex for some $m > 0$ if for all $x, y \in R^{n}$ ,
$\begin{matrix} (3.13) & f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{m}{2} ∥ y - x ∥^{2} . \end{matrix}$

Strong convexity is a strengthened form of convexity that guarantees the function grows at least quadratically away from any point, ensuring a unique global minimum.

Remark. $m$ is often denoted $μ$ in many texts. We keep $m$ for consistency with the lecture notation.

Chain of Equivalences. For a twice-differentiable function $f$ , the following four statements are equivalent:

\begin{aligned} (I) & f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{m}{2} ∥ y - x ∥^{2} (Strong FOC) \\ (II) & h (x) = f (x) - \frac{m}{2} ∥ x ∥^{2} is convex \\ (III) & ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq m ∥ x - y ∥^{2} (Strong Gradient Monotonicity) \\ (IV) & \nabla^{2} f (x) ⪰ m I (Hessian bound) \end{aligned}

Proof: (I) $\Rightarrow$ (II) (Strong FOC $\Rightarrow$ $h$ convex).

Define $h (x) = f (x) - \frac{m}{2} ∥ x ∥^{2}$ . For any $x, y$ ,

\begin{aligned} h (y) - h (x) - ⟨ \nabla h (x), y - x ⟩ & = [f (y) - \frac{m}{2} ∥ y ∥^{2}] - [f (x) - \frac{m}{2} ∥ x ∥^{2}] - ⟨ \nabla f (x) - m x, y - x ⟩ \\ = f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ - \frac{m}{2} (∥ y ∥^{2} - ∥ x ∥^{2} - 2 ⟨ x, y - x ⟩) \\ = f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ - \frac{m}{2} ∥ y - x ∥^{2} \\ \geq 0, \end{aligned}

where the last line uses condition (I). Hence $h (y) \geq h (x) + ⟨ \nabla h (x), y - x ⟩$ , so $h$ is convex by the first-order condition. $◻$

Proof: (II) $\Rightarrow$ (IV) (Convexity of $h$ $\Rightarrow$ Hessian bound).

If $h (x) = f (x) - \frac{m}{2} ∥ x ∥^{2}$ is convex and twice differentiable, then $\nabla^{2} h (x) ⪰ 0$ for all $x$ . Computing the Hessian:

\nabla^{2} h (x) = \nabla^{2} f (x) - m I .

Thus $\nabla^{2} f (x) - m I ⪰ 0$ , i.e., $\nabla^{2} f (x) ⪰ m I$ . $◻$

Proof: (IV) $\Rightarrow$ (III) (Hessian bound $\Rightarrow$ Strong Gradient Monotonicity).

Assume $\nabla^{2} f (x) ⪰ m I$ for all $x$ . For any $x, y$ , use the integral representation:

\begin{aligned} ⟨ \nabla f (y) - \nabla f (x), y - x ⟩ & = ⟨ \int_{0}^{1} \nabla^{2} f (x + t (y - x)) (y - x) d t, y - x ⟩ \\ = \int_{0}^{1} (y - x)^{⊤} \nabla^{2} f (x + t (y - x)) (y - x) d t \\ \geq \int_{0}^{1} m ∥ y - x ∥^{2} d t = m ∥ y - x ∥^{2} . \end{aligned}

This establishes condition (III). $◻$

Proof: (III) $\Rightarrow$ (I) (Strong Gradient Monotonicity $\Rightarrow$ Strong FOC).

Assume $⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq m ∥ x - y ∥^{2}$ . Using the Fundamental Theorem of Calculus:

\begin{aligned} f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ & = \int_{0}^{1} ⟨ \nabla f (x + t (y - x)) - \nabla f (x), y - x ⟩ d t . \end{aligned}

Let $s = ∥ y - x ∥$ . Apply the strong monotonicity condition along the segment; for $z_{t} = x + t (y - x)$ , evaluate with $x$ and $z_{t}$ :

⟨ \nabla f (z_{t}) - \nabla f (x), z_{t} - x ⟩ \geq m ∥ z_{t} - x ∥^{2} = m t^{2} s^{2} .

Note that $z_{t} - x = t (y - x)$ , so:

t \cdot ⟨ \nabla f (z_{t}) - \nabla f (x), y - x ⟩ \geq m t^{2} s^{2} ⟹ ⟨ \nabla f (z_{t}) - \nabla f (x), y - x ⟩ \geq m t s^{2} .

Integrating over $t \in [0, 1]$ :

\int_{0}^{1} ⟨ \nabla f (x + t (y - x)) - \nabla f (x), y - x ⟩ d t \geq \int_{0}^{1} m t s^{2} d t = \frac{m}{2} s^{2} = \frac{m}{2} ∥ y - x ∥^{2} .

Thus $f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{m}{2} ∥ y - x ∥^{2}$ . $◻$

This completes the chain of equivalences for m-strong convexity.

3.12.5 Linear Convergence under Smooth + Strong Convexity

When a function is both L-smooth and m-strongly convex, gradient descent achieves linear (exponential) convergence to the unique global minimizer.

Theorem (Linear convergence). Let $f$ be L-smooth and m-strongly convex. Let $x^{*}$ be the unique global minimizer. Gradient descent with step size $η = \frac{1}{L}$ satisfies:
$\begin{matrix} (3.14) & f (x_{T}) - f (x^{*}) \leq {(1 - \frac{m}{L})}^{T} (f (x_{0}) - f (x^{*})) . \end{matrix}$

Proof. From the Descent Lemma with $η = \frac{1}{L}$ :

f (x_{t + 1}) \leq f (x_{t}) - \frac{1}{2 L} ∥ \nabla f (x_{t}) ∥^{2} .

By m-strong convexity, we have a lower bound relating the gradient norm to the optimality gap. For any $x$ ,

f (x^{*}) \geq f (x) + ⟨ \nabla f (x), x^{*} - x ⟩ + \frac{m}{2} ∥ x^{*} - x ∥^{2} .

Minimizing the right-hand side over $y$ (a quadratic in $y$ ) gives:

f (x^{*}) \geq f (x) - \frac{1}{2 m} ∥ \nabla f (x) ∥^{2},

since $min_{y} {f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{m}{2} ∥ y - x ∥^{2}} = f (x) - \frac{1}{2 m} ∥ \nabla f (x) ∥^{2}$ .

Rearranging yields the Polyak–Łojasiewicz (PL) inequality:

\begin{matrix} (3.15) & ∥ \nabla f (x) ∥^{2} \geq 2 m (f (x) - f (x^{*})) . \end{matrix}

Plug this into the Descent Lemma:

\begin{aligned} f (x_{t + 1}) - f (x^{*}) & \leq f (x_{t}) - f (x^{*}) - \frac{1}{2 L} \cdot 2 m (f (x_{t}) - f (x^{*})) \\ = (1 - \frac{m}{L}) (f (x_{t}) - f (x^{*})) . \end{aligned}

Unrolling this inequality over $T$ iterations gives:

f (x_{T}) - f (x^{*}) \leq {(1 - \frac{m}{L})}^{T} (f (x_{0}) - f (x^{*})) . ◻

Definition (Condition number). The ratio
$κ = \frac{L}{m} \geq 1$
is called the condition number of $f$ . It measures how "elongated" the sublevel sets of $f$ are.

Since $0 < m \leq L$ , we have $κ \geq 1$ . The convergence rate can be rewritten as:

f (x_{T}) - f (x^{*}) \leq {(1 - \frac{1}{κ})}^{T} (f (x_{0}) - f (x^{*})) .

Using the inequality $1 - x \leq e^{- x}$ , we obtain the more intuitive bound:

f (x_{T}) - f (x^{*}) \leq \exp (- \frac{T}{κ}) (f (x_{0}) - f (x^{*})) .

To achieve $f (x_{T}) - f (x^{*}) \leq ε$ , it suffices to take

T \geq κ \log \frac{f (x_{0}) - f (x^{*})}{ε},

which is linear in the condition number $κ$ . When the function is well-conditioned (small $κ$ ), gradient descent converges rapidly. For ill-conditioned problems (large $κ$ ), convergence can be slow.

Summary. The interplay between smoothness and strong convexity is fundamental to first-order optimization:

Property	Guarantee	Rate
L-smooth only	$min_{t < T} \| \nabla f (x_{t}) \|^{2} \leq O (1 / T)$	Sublinear to stationary point
L-smooth + convex	$f (x_{T}) - f (x^{*}) \leq O (1 / T)$	Sublinear to global minimum
L-smooth + m-strongly convex	$f (x_{T}) - f (x^{*}) \leq (1 - m / L)^{T} Δ_{0}$	Linear (exponential)

Strong convexity is what transforms the sublinear rate of gradient descent into a linear (geometric) rate, with the condition number $κ = L / m$ governing the constant.

Chapter 3: Convex Functions ​

3.1 Operations That Preserve Convexity of Sets ​

3.1.1 Intersection ​

3.1.2 Affine Mapping ​

3.1.3 Inverse Affine Mapping ​

3.2 Fundamental Concepts in Set Topology ​

3.2.1 Open and Closed Sets ​

3.2.2 Interior and Relative Interior ​

3.2.3 Limit Points and Closure ​

3.2.4 Boundary ​

3.3 Definition of Convex Functions ​

3.4 Common Examples of Convex Functions ​

3.5 Convexity via Restriction to Lines ​

3.6 Epigraph and Sublevel Sets ​

3.6.1 Epigraph ​

3.6.2 Sublevel Sets ​

3.7 Jensen's Inequality ​

3.8 First-Order Condition for Convexity ​

3.8.1 Gradient and Differentiability ​

3.8.2 Proof of Equivalence ​

3.9 Gradient Monotonicity ​

3.9.1 Proof of Equivalence between First-Order Condition and Gradient Monotonicity ​

3.10 Second-Order Condition for Convexity ​

3.10.1 Proof of Equivalence between Gradient Monotonicity and Second-Order Condition ​

3.11 Chain of Equivalences ​

3.12 Smooth and Strongly Convex Functions ​

3.12.1 L-Smooth Functions ​

3.12.2 Gradient Descent Convergence under Smoothness ​

3.12.3 Smooth + Convex Convergence ​

3.12.4 m-Strongly Convex Functions ​

3.12.5 Linear Convergence under Smooth + Strong Convexity ​

Chapter 3: Convex Functions

3.1 Operations That Preserve Convexity of Sets

3.1.1 Intersection

3.1.2 Affine Mapping

3.1.3 Inverse Affine Mapping

3.2 Fundamental Concepts in Set Topology

3.2.1 Open and Closed Sets

3.2.2 Interior and Relative Interior

3.2.3 Limit Points and Closure

3.2.4 Boundary

3.3 Definition of Convex Functions

3.4 Common Examples of Convex Functions

3.5 Convexity via Restriction to Lines

3.6 Epigraph and Sublevel Sets

3.6.1 Epigraph

3.6.2 Sublevel Sets

3.7 Jensen's Inequality

3.8 First-Order Condition for Convexity

3.8.1 Gradient and Differentiability

3.8.2 Proof of Equivalence

3.9 Gradient Monotonicity

3.9.1 Proof of Equivalence between First-Order Condition and Gradient Monotonicity

3.10 Second-Order Condition for Convexity

3.10.1 Proof of Equivalence between Gradient Monotonicity and Second-Order Condition

3.11 Chain of Equivalences

3.12 Smooth and Strongly Convex Functions

3.12.1 L-Smooth Functions

3.12.2 Gradient Descent Convergence under Smoothness

3.12.3 Smooth + Convex Convergence

3.12.4 m-Strongly Convex Functions

3.12.5 Linear Convergence under Smooth + Strong Convexity