Skip to content

Chapter 3: Convex Functions

Convex functions are the fundamental objects of study in convex optimization. This chapter develops the definition of convex functions, their properties, various characterizations (first-order, second-order, and gradient monotonicity), as well as operations that preserve convexity and related geometric concepts.


3.1 Operations That Preserve Convexity of Sets

Before studying convex functions per se, it is useful to review how convexity of sets can be established through operations that preserve it. Given a set CRn, there are three practical approaches to verifying convexity:

  1. Apply the definition directly:

    x1,x2C,0θ1θx1+(1θ)x2C
  2. Show that C is obtained from simple convex sets (hyperplanes, halfspaces, norm balls, etc.) by operations that preserve convexity:

    • Intersection
    • Affine function
    • Perspective function
    • Linear-fractional function
  3. Show that C is a sublevel set of a convex function (discussed later).

3.1.1 Intersection

Property (Intersection of Convex Sets). The intersection of any number (including infinite) of convex sets is convex: if each Ci is convex, then iICi is convex.

Examples:

  • Affine space: An affine space defined by linear equalities can be expressed as an intersection of hyperplanes:

    {x:Ax=b}=i=1m{x:aix=bi}

    Each equality constraint is the intersection of two halfspaces {x:aixbi} and {x:aixbi}.

  • -ball:

    B={x:x1}=i=1n{x:1xi1}

    Each constraint is a halfspace, so the intersection is convex.

  • PSD cone: The cone of positive semidefinite matrices

    S+n={X0}=vRn{X:vXv0}

    is the intersection of infinitely many halfspaces.

3.1.2 Affine Mapping

Property (Affine Image). If C is convex and ARm×n, bRm, then the image under an affine map

{y=Ax+b:xC}

is convex.

Example (Scaling and Translation): Scaling and translating a convex set preserves convexity:

C={x:x21}AC+b={y:y=Ax+b,x21}

Interpretation: Linear transformations do not "bend" the set.

3.1.3 Inverse Affine Mapping

Property (Affine Preimage). If SRn is convex and f:RmRn is affine, the preimage is also convex:

f1(S)={xRm:f(x)S}

Application: Linear Matrix Inequality (LMI). Consider

F(x)=F0+i=1nxiFi0

The set of positive semidefinite matrices S+m={XSm:X0} is convex. The mapping xF(x) is affine. Therefore, the solution set

{x:F(x)0}=F1(S+m)

is convex by the affine preimage property.


3.2 Fundamental Concepts in Set Topology

We briefly review essential topological concepts that are foundational in convex analysis.

  • A point xRn is an n-tuple (x1,,xn).
  • A set SRn is a collection of points.
  • The open ball (neighborhood) centered at x with radius ϵ:Bϵ(x)={yRn:yx<ϵ}

3.2.1 Open and Closed Sets

Open Set. Every point has a neighborhood contained in the set:

xS,ϵ>0:Bϵ(x)S

Example: (0,1).

Closed Set. Contains all its limit points; its complement is open. Example: [0,1].

3.2.2 Interior and Relative Interior

Interior. int(S)={xS:ϵ>0,Bϵ(x)S} Points with a neighborhood contained in S. Example: int([0,1])=(0,1).

Relative Interior (relint). If S lies in an affine subspace A, then xS is a relative interior point if

Bϵ(x)AS

for some ϵ>0. The set of all such points is relint(S).

Example: The line segment [0,1] in R2 has empty interior, but its relative interior is (0,1) within the line it spans.

3.2.3 Limit Points and Closure

Limit Point. Every neighborhood contains other points from the set:

ϵ>0:Bϵ(x)(S{x})

Points arbitrarily close to x are in S.

Closure. cl(S) is S together with all its limit points; it is the smallest closed set containing S.

cl((0,1))=[0,1]

3.2.4 Boundary

Boundary. S=cl(S)int(S)

Every neighborhood contains points from both S and its complement.

Example: S=(0,1]

  • cl(S)=[0,1], int(S)=(0,1), S={0,1}

Key Relationships:

  • cl(S)=int(S)S
  • int(S)S=
  • int(S)Scl(S)

3.3 Definition of Convex Functions

Definition (Convex Function). A function f:CR defined on a convex set CRn is called convex if

(3.1)f(θx+(1θ)y)θf(x)+(1θ)f(y)

for all x,yC and all θ[0,1].

Definition (Strictly Convex). f is strictly convex if the inequality is strict whenever xy and θ(0,1):

(3.2)f(θx+(1θ)y)<θf(x)+(1θ)f(y)

Definition (Concave Function). f is concave if f is convex, i.e.,

(3.3)f(θx+(1θ)y)θf(x)+(1θ)f(y)

Geometric meaning: For a convex function f, the graph lies below the chord connecting any two points. For a concave function, the graph lies above the chord.

Note: Linear (affine) functions are both convex and concave, but neither strictly convex nor strictly concave.


3.4 Common Examples of Convex Functions

Univariate examples:

  • f(x)=ax+b (affine)
  • f(x)=x2, f(x)=eax, f(x)=|x|p for p1
  • f(x)=logx on (0,)

Multivariate examples:

  • f(x)=ax+b (affine)
  • f(x)=xPx with P0 (quadratic)
  • f(x)=x (any norm)
  • f(x)=max{x1,,xn}
  • f(x)=log(iexi) (log-sum-exp)

3.5 Convexity via Restriction to Lines

Theorem (Restriction to Lines). A function f:CR with domf=CRn is convex if and only if for every z,vRn with {z+tv:tR}C, the one-dimensional function

ϕ(t)=f(z+tv)

is convex on its domain {t:z+tvC}.

Intuition: This theorem reduces multivariate convexity to one-dimensional verification.

Examples:

  • f(x,y)=x2+y2g(t)=(z1+tv1)2+(z2+tv2)2 (quadratic in t)
  • f(X)=logdetX with domf=S++n

Proposition. Let CRn be convex and f:CR. Then f is convex if and only if for every line L={x+tv:tR} meeting C, the restriction ϕ(t)=f(x+tv) is convex on {t:x+tvC}.

Proof.

() If f is convex, then for any t1,t2 with x+t1v,x+t2vC and λ[0,1],

f(λ(x+t1v)+(1λ)(x+t2v))λf(x+t1v)+(1λ)f(x+t2v),

which is the same as

ϕ(λt1+(1λ)t2)λϕ(t1)+(1λ)ϕ(t2),

so ϕ is convex.

() Conversely, let x,yC and θ[0,1]. Define ϕ(t)=f(y+t(xy)). Then

f(θx+(1θ)y)=ϕ(θ)θϕ(1)+(1θ)ϕ(0)=θf(x)+(1θ)f(y),

so f is convex.


3.6 Epigraph and Sublevel Sets

3.6.1 Epigraph

Definition (Epigraph). For a function f:RnR, the epigraph of f is the set

epif={(x,t)Rn+1:f(x)t}

Geometric meaning: The epigraph is the region on or above the graph of f.

Key Property. f is convex epif is a convex set.

Proposition. Let f:RnR{+} and epif={(x,t)Rn+1:f(x)t}. Then f is convex on its domain if and only if epif is a convex set.

Proof.

() Assume f is convex. Take (x1,t1),(x2,t2)epif so f(xi)ti. For any θ[0,1], set x¯=θx1+(1θ)x2 and t¯=θt1+(1θ)t2. By convexity of f,

f(x¯)θf(x1)+(1θ)f(x2)θt1+(1θ)t2=t¯,

hence (x¯,t¯)epif. Thus epif is convex.

() Conversely, assume epif is convex. For any x1,x2domf and θ[0,1], the points (xi,f(xi))epif, so their convex combination (x¯,t¯) with x¯=θx1+(1θ)x2 and t¯=θf(x1)+(1θ)f(x2) lies in epif. Therefore f(x¯)t¯, i.e.

f(θx1+(1θ)x2)θf(x1)+(1θ)f(x2),

so f is convex.

3.6.2 Sublevel Sets

Definition (Sublevel Set). For f:RnR and αR, the α-sublevel set is

Sα={xdom(f):f(x)α}

the set of points where f takes values α.

Key Property. If f is convex, then all sublevel sets Sα are convex.

Warning: The converse is not true. For example, f(x)=x3 has convex sublevel sets but is not convex.

Examples:

  • f(x)=x2: Sα=[α,α] (convex intervals)
  • f(x)=x3: Sα=(,α3] (convex intervals, but f is not convex)

3.7 Jensen's Inequality

Theorem (Jensen's Inequality — Discrete Form). If f is convex and θi0 with i=1kθi=1, then

(3.4)f(i=1kθixi)i=1kθif(xi)

Integral Form. For convex f and a probability measure p:

f(xp(x)dx)f(x)p(x)dx

Expectation Form. For convex f and a random variable X:

f(E[X])E[f(X)]

3.8 First-Order Condition for Convexity

3.8.1 Gradient and Differentiability

Precondition: f is differentiable if the gradient

f(x)=(f(x)x1,f(x)x2,,f(x)xn)

exists at each xdomf.

Theorem (First-Order Condition). A differentiable function f:RnR is convex if and only if for all x,ydomf:

(3.5)f(y)f(x)+f(x)(yx)

Implication: If f(x)=0, then x is a global minimum.

Example: f(x)=x2, f(x)=2x:

f(y)=y2x2+2x(yx)

3.8.2 Proof of Equivalence

Direction 1: Basic convexity first-order condition.

Assume f:RnR is convex and differentiable. For any x,y and θ(0,1]:

f(x+θ(yx))(1θ)f(x)+θf(y).

Subtract f(x) and divide by θ>0:

f(x+θ(yx))f(x)θf(y)f(x).

Take θ0. Differentiability implies:

limθ0f(x+θ(yx))f(x)θ=f(x)(yx).

Hence:

f(y)f(x)+f(x)(yx).

Direction 2: First-order condition basic convexity.

Assume for all x,y:

f(y)f(x)+f(x)(yx).

Let z=θx+(1θ)y with θ[0,1]. Apply the condition twice:

f(x)f(z)+f(z)(xz),f(y)f(z)+f(z)(yz).

Multiply the two inequalities by θ and 1θ respectively and add:

θf(x)+(1θ)f(y)f(z)+f(z)(θx+(1θ)yz).

Since z=θx+(1θ)y, we have:

f(θx+(1θ)y)θf(x)+(1θ)f(y).

3.9 Gradient Monotonicity

Definition (Monotone Gradient). A differentiable function f:RnR has a monotone gradient if for all x,ydomf:

(3.6)f(x)f(y),xy0

Connection to Convexity:

  • If f is convex, its gradient is monotone.
  • Conversely, if f is monotone, f is convex.

Example: f(x)=xQx with Q0:

f(x)=2Qx,f(x)f(y),xy=2(xy)Q(xy)0

3.9.1 Proof of Equivalence between First-Order Condition and Gradient Monotonicity

Direction 1: First-order condition Gradient Monotonicity.

Assume f:DR is differentiable and satisfies the first-order condition:

f(y)f(x)+f(x)(yx),x,yD.

Write this inequality twice, swapping x and y:

f(y)f(x)+f(x)(yx),f(x)f(y)+f(y)(xy).

Adding the two inequalities yields:

0f(x)(yx)+f(y)(xy),

which simplifies to

f(x)f(y),xy0.

Direction 2: Gradient Monotonicity First-order condition.

Assume f is differentiable and satisfies gradient monotonicity:

(f(y)f(x))(yx)0.

For any x,yD, by the Fundamental Theorem of Calculus along the line segment:

f(y)f(x)=01ddtf(x+t(yx))dt=01f(x+t(yx))(yx)dt.

Subtract the tangent at x:

f(y)f(x)f(x)(yx)=01(f(x+t(yx))f(x))(yx)dt0.

This gives the first-order condition:

f(y)f(x)+f(x)(yx).

3.10 Second-Order Condition for Convexity

Precondition: f is twice-differentiable if domf is open and the Hessian

2f(x)ij=2f(x)xixj,i,j=1,,n,

exists at each xdomf.

Theorem (Second-Order Condition). A twice-differentiable function f:RnR is convex if and only if its Hessian is positive semidefinite for all xdomf:

(3.7)2f(x)0

Example: f(x)=xQx with Q0:

f(x)=2Qx,2f(x)=2Q0

3.10.1 Proof of Equivalence between Gradient Monotonicity and Second-Order Condition

Direction 1: Gradient Monotonicity Hessian PSD.

Assume (f(y)f(x))(yx)0 for all x,yD. Consider an infinitesimal perturbation along direction d: y=x+ϵd, ϵ>0. Gradient monotonicity implies:

(f(x+ϵd)f(x))(ϵd)0(f(x+ϵd)f(x))dϵ0.

Take the limit as ϵ0:

d2f(x)d0dRn.

Direction 2: Hessian PSD Gradient Monotonicity.

Assume f:DR is twice differentiable and 2f(x)0 for all xD. For any x,yD, consider the line segment x+t(yx), t[0,1]:

f(y)f(x)=01ddtf(x+t(yx))dt=012f(x+t(yx))(yx)dt.

Take inner product with yx:

(f(y)f(x))(yx)=01(yx)2f(x+t(yx))(yx)dt0.

3.11 Chain of Equivalences

For twice-differentiable functions, the following four statements are equivalent characterizations of convexity:

   Convexity
f(θx + (1−θ)y) ≤ θf(x) + (1−θ)f(y)

 First-order Condition                  Gradient Monotonicity
f(y) ≥ f(x) + ∇f(x)ᵀ(y−x)    ⇔    ⟨∇f(x)−∇f(y), x−y⟩ ≥ 0

 Second-order Condition
    ∇²f(x) ⪰ 0

In summary:

  1. Convexity (Definition): f(θx+(1θ)y)θf(x)+(1θ)f(y)
  2. First-order Condition: f(y)f(x)+f(x)(yx)
  3. Gradient Monotonicity: f(x)f(y),xy0
  4. Second-order Condition: 2f(x)0

Each provides a different practical tool for verifying or exploiting convexity in analysis and algorithm design.


3.12 Smooth and Strongly Convex Functions

Beyond mere convexity, many optimization algorithms rely on stronger structural assumptions—smoothness and strong convexity—to obtain quantitative convergence guarantees. This section develops L-smoothness (Lipschitz gradient) and m-strong convexity, their equivalent characterizations, and the convergence rates of gradient descent under these conditions.

3.12.1 L-Smooth Functions

Definition (L-smooth function). A differentiable function f:RnR is called L-smooth (or has an L-Lipschitz continuous gradient) for some L>0 if for all x,yRn,

(3.8)f(x)f(y)Lxy.

Smoothness bounds how fast the gradient can change. Unlike convexity, it does not guarantee that stationary points are global minima; it only guarantees a quadratic upper bound on the function.


Chain of Equivalences. For a twice-differentiable function f, the following four statements are equivalent characterizations of L-smoothness:

(I)f(x)f(y)Lxy(Lipschitz gradient)(II)g(x)=L2x2f(x) is convex(III)f(y)f(x)+f(x),yx+L2yx2(Descent Lemma)(IV)2f(x)LI(Hessian bound)

We now prove the full cycle of equivalences.


Proof: (I) (III) (Lipschitz gradient Descent Lemma).

Using the Fundamental Theorem of Calculus along the line segment x+t(yx):

f(y)f(x)=01f(x+t(yx)),yxdt.

Subtract the linear approximation at x:

f(y)f(x)f(x),yx=01f(x+t(yx))f(x),yxdt01f(x+t(yx))f(x)yxdt01Ltyxyxdt=Lyx201tdt=L2yx2.

Thus f(y)f(x)+f(x),yx+L2yx2.


Proof: (III) (II) (Descent Lemma g convex).

Define g(x)=L2x2f(x). To show g is convex, we verify the first-order condition. For any x,y:

g(y)g(x)g(x),yx=[L2y2f(y)][L2x2f(x)]Lxf(x),yx=L2y2L2x2Lx,yx[f(y)f(x)f(x),yx]=L2yx2[f(y)f(x)f(x),yx]L2yx2L2yx2=0,

where the inequality uses condition (III). Hence g(y)g(x)+g(x),yx, so g is convex.


Proof: (II) (IV) (Convexity of g Hessian bound).

If g(x)=L2x2f(x) is convex and twice differentiable, then 2g(x)0 for all x. Computing the Hessian:

2g(x)=LI2f(x).

Thus LI2f(x)0, i.e., 2f(x)LI.


Proof: (IV) (I) (Hessian bound Lipschitz gradient).

Assume 2f(x)LI for all x. For any x,y, use the integral representation:

f(y)f(x)=012f(x+t(yx))(yx)dt.

Since for any vector v, 2f(z)v2f(z)opv and 2f(z)LI implies 2f(z)opL, we have:

f(y)f(x)012f(x+t(yx))opyxdtLyx.

Thus condition (I) holds.

This completes the chain of equivalences for L-smoothness.


3.12.2 Gradient Descent Convergence under Smoothness

Gradient descent updates:

(3.9)xt+1=xtηf(xt).

Under L-smoothness alone, we can guarantee that the gradient norm converges to zero.

Lemma (Descent Lemma for gradient descent). If f is L-smooth and we take η=1L, then for each iteration:

(3.10)f(xt+1)f(xt)12Lf(xt)2.

Proof. Applying the Descent Lemma (condition III) with y=xt+1=xt1Lf(xt):

f(xt+1)f(xt)+f(xt),xt+1xt+L2xt+1xt2=f(xt)1Lf(xt)2+L21L2f(xt)2=f(xt)12Lf(xt)2.

Rearranging gives the claimed bound.


Theorem (Convergence to stationary point). Let f be L-smooth and f=infxf(x)>. After T iterations of gradient descent with η=1L:

(3.11)min0tT1f(xt)22L(f(x0)f)T.

Consequently, mint<Tf(xt)0 at a rate O(1/T).

Proof. Summing the Descent Lemma over t=0,1,,T1:

t=0T1(f(xt)f(xt+1))12Lt=0T1f(xt)2.

The left-hand side telescopes:

f(x0)f(xT)12Lt=0T1f(xt)2T2Lmin0tT1f(xt)2.

Since f(xT)f,

T2Lmint<Tf(xt)2f(x0)f,

which rearranges to the stated bound.

This shows gradient descent converges to a stationary point (where f0) at a rate O(1/T).


3.12.3 Smooth + Convex Convergence

If f is not only L-smooth but also convex, we obtain convergence to a global minimum (not just a stationary point).

Theorem (Convergence under smoothness + convexity). Let f be L-smooth and convex, and let x be a global minimizer of f. After T iterations of gradient descent with η=1L:

(3.12)f(xT)f(x)L2Tx0x2.

Proof. Write η=1L. For each iteration:

xt+1x2=xtηf(xt)x2=xtx22ηf(xt),xtx+η2f(xt)2.

By convexity (first-order condition), f(xt)f(x)f(xt),xtx. Hence:

2ηf(xt),xtx2η(f(xt)f(x)).

Using the Descent Lemma (3.10), η2f(xt)2=1L2f(xt)22L(f(xt)f(xt+1)). Substituting:

xt+1x2xtx22L(f(xt)f(x))+2L(f(xt)f(xt+1))=xtx22L(f(xt+1)f(x)).

Let Δt=f(xt)f(x). Then xt+1x2xtx22LΔt+1, which implies:

Δt+1L2(xtx2xt+1x2).

Summing this inequality from t=0 to T1 telescopes:

t=1TΔtL2(x0x2xTx2)L2x0x2.

Since Δt is non-increasing (the Descent Lemma guarantees f(xt+1)f(xt)), we have TΔTt=0T1ΔtL2x0x2. Therefore:

f(xT)f(x)L2Tx0x2.

This is the O(1/T) rate for smooth convex optimization.


3.12.4 m-Strongly Convex Functions

Definition (m-strongly convex function). A differentiable function f:RnR is called m-strongly convex for some m>0 if for all x,yRn,

(3.13)f(y)f(x)+f(x),yx+m2yx2.

Strong convexity is a strengthened form of convexity that guarantees the function grows at least quadratically away from any point, ensuring a unique global minimum.

Remark. m is often denoted μ in many texts. We keep m for consistency with the lecture notation.


Chain of Equivalences. For a twice-differentiable function f, the following four statements are equivalent:

(I)f(y)f(x)+f(x),yx+m2yx2(Strong FOC)(II)h(x)=f(x)m2x2 is convex(III)f(x)f(y),xymxy2(Strong Gradient Monotonicity)(IV)2f(x)mI(Hessian bound)

Proof: (I) (II) (Strong FOC h convex).

Define h(x)=f(x)m2x2. For any x,y,

h(y)h(x)h(x),yx=[f(y)m2y2][f(x)m2x2]f(x)mx,yx=f(y)f(x)f(x),yxm2(y2x22x,yx)=f(y)f(x)f(x),yxm2yx20,

where the last line uses condition (I). Hence h(y)h(x)+h(x),yx, so h is convex by the first-order condition.


Proof: (II) (IV) (Convexity of h Hessian bound).

If h(x)=f(x)m2x2 is convex and twice differentiable, then 2h(x)0 for all x. Computing the Hessian:

2h(x)=2f(x)mI.

Thus 2f(x)mI0, i.e., 2f(x)mI.


Proof: (IV) (III) (Hessian bound Strong Gradient Monotonicity).

Assume 2f(x)mI for all x. For any x,y, use the integral representation:

f(y)f(x),yx=012f(x+t(yx))(yx)dt,yx=01(yx)2f(x+t(yx))(yx)dt01myx2dt=myx2.

This establishes condition (III).


Proof: (III) (I) (Strong Gradient Monotonicity Strong FOC).

Assume f(x)f(y),xymxy2. Using the Fundamental Theorem of Calculus:

f(y)f(x)f(x),yx=01f(x+t(yx))f(x),yxdt.

Let s=yx. Apply the strong monotonicity condition along the segment; for zt=x+t(yx), evaluate with x and zt:

f(zt)f(x),ztxmztx2=mt2s2.

Note that ztx=t(yx), so:

tf(zt)f(x),yxmt2s2f(zt)f(x),yxmts2.

Integrating over t[0,1]:

01f(x+t(yx))f(x),yxdt01mts2dt=m2s2=m2yx2.

Thus f(y)f(x)+f(x),yx+m2yx2.

This completes the chain of equivalences for m-strong convexity.


3.12.5 Linear Convergence under Smooth + Strong Convexity

When a function is both L-smooth and m-strongly convex, gradient descent achieves linear (exponential) convergence to the unique global minimizer.

Theorem (Linear convergence). Let f be L-smooth and m-strongly convex. Let x be the unique global minimizer. Gradient descent with step size η=1L satisfies:

(3.14)f(xT)f(x)(1mL)T(f(x0)f(x)).

Proof. From the Descent Lemma with η=1L:

f(xt+1)f(xt)12Lf(xt)2.

By m-strong convexity, we have a lower bound relating the gradient norm to the optimality gap. For any x,

f(x)f(x)+f(x),xx+m2xx2.

Minimizing the right-hand side over y (a quadratic in y) gives:

f(x)f(x)12mf(x)2,

since miny{f(x)+f(x),yx+m2yx2}=f(x)12mf(x)2.

Rearranging yields the Polyak–Łojasiewicz (PL) inequality:

(3.15)f(x)22m(f(x)f(x)).

Plug this into the Descent Lemma:

f(xt+1)f(x)f(xt)f(x)12L2m(f(xt)f(x))=(1mL)(f(xt)f(x)).

Unrolling this inequality over T iterations gives:

f(xT)f(x)(1mL)T(f(x0)f(x)).

Definition (Condition number). The ratio

κ=Lm1

is called the condition number of f. It measures how "elongated" the sublevel sets of f are.

Since 0<mL, we have κ1. The convergence rate can be rewritten as:

f(xT)f(x)(11κ)T(f(x0)f(x)).

Using the inequality 1xex, we obtain the more intuitive bound:

f(xT)f(x)exp(Tκ)(f(x0)f(x)).

To achieve f(xT)f(x)ε, it suffices to take

Tκlogf(x0)f(x)ε,

which is linear in the condition number κ. When the function is well-conditioned (small κ), gradient descent converges rapidly. For ill-conditioned problems (large κ), convergence can be slow.


Summary. The interplay between smoothness and strong convexity is fundamental to first-order optimization:

PropertyGuaranteeRate
L-smooth onlymint<T|f(xt)|2O(1/T)Sublinear to stationary point
L-smooth + convexf(xT)f(x)O(1/T)Sublinear to global minimum
L-smooth + m-strongly convexf(xT)f(x)(1m/L)TΔ0Linear (exponential)

Strong convexity is what transforms the sublinear rate of gradient descent into a linear (geometric) rate, with the condition number κ=L/m governing the constant.