Skip to content

Gradient Descent

1 Convex and Non-Convex Optimization

Which value of θ leads to minimum f(θ)?

1. 1 Convex Functions

A convex function is a continuous function whose value at the midpoint of every interval in its domain does not exceed the arithmetic mean of its values at the ends of the interval.

Definition

A function f is convex if

f[λx1+(1λ)x2]λf(x1)+(1λ)f(x2)

for any λ[0,1].

Retrieved from Purdue University

This means the line segment between any two points on the curve of the function lies above or on the curve itself.

For 1D functions, they are convex if

f(x)0.

And for a convex function f(θ), its local minimum will be a global minimum.

If we suppose θ is a local minimum, then for any other point θ in the domain of f(θ), we have

f(θ)f(θ).

Note that θ can be high-dimensional.

2 Argmin and Argmax

Definition

The input to a function that yields the minimum is called theargmin, since it is the argument to the function that gives the minimum.

Similarly, the argmax of a function is the input that gives the function's maximum.

And the value of θ that leads to the minimum f(θ) denotes argminθ. To find it, we need:

{df(θ)dθ=0,d2f(θ)dθ2>0.

But if function is Non-Convex, we should use gradient descent, the golden rule of non-convex optimization, widely applied in machine learning / deep learning.

3 The Golden Method - Gradient Descent

3.1 Definition

First, let's recap some basic descent methods. Take the form

xk+1=xk+αkpk,k=0,1,

The gradients are just first-order derivatives df(θ)dθ. Since we know that given a position θ0, we can calculate

df(θ0)dθ0=limΔθ0f(θ0+Δθ)θ0Δθ.

Hence, gradient descent, namely, following the negative direction of the gradients of the curve, is defined as follows:

θt+1θtαdf(θt)dθ.

Where,

  • 𝜃t+1 is the new / updated variable after single step of gradient descent;
  • 𝛼 is the step size or the learning rate that controls the stride;
  • Δθ=αdf(θt)dθ is the resulted change of variable value.

In this case, we are moving towards the negative direction of the gradients. So if t omitted,

θθαdf(θ)dθ.

Then looking at the function change after changing the variable, i.e. Δf(θ) after Δθ=αdf(θ)dθ.

Since Δf(θ)Δθdf(θ)dθ, Substitute, and we find Δf(θ)0.

If result satisfies accuracy ε, then algorithm ends.

Question

  1. However, the choice of learning rate α is crucial. What if learning rate is too small, or too large?

    1. When the Learning Rate α is too large:
    • Overshooting. A large learning rate may push θ far past the minimum, and the algorithm may fail to converge.
    • Oscillations or Divergence. If learning rate is excessively large, the algorithm may push θ away from the minimum. This can lead to oscillations or divergence.
    • Loss of Critical Information.In a non-convex landscape, the algorithm might skip over local minima or saddle points because of excessively large steps.
    1. When the Learning Rate α is too small:
    • Slow Convergence. A small learning rate may lead to slow convergence, as the algorithm takes tiny steps towards the minimum.
    • Stuck in Local Minima. A small learning rate may cause the algorithm to get stuck in local minima, as it is unable to escape the current position.
    • Sensitive to Initial Conditions. A small learning rate may make the algorithm sensitive to initial conditions, as it may get stuck in a local minimum based on the initial position of θ.
    • Vanishing Updates. A small learning rate may cause the updates to become vanishingly small, leading to slow convergence or stagnation.

Question

  1. Δf(θ)Δθdf(θ)dθ0. Why do we use approximation, and under what condition, the approximation can be changed to equality?

    Why Use Approximation

    We use approximation because in gradient descent, the change in the parameter θ, denoted as Δθ, is typically small. For small changes, the change in the objective function f(θ) can be approximated by the product of the gradient df(θ)dθ and the change in the parameter Δθ. This is a first-order Taylor series approximation.

    Condition for Approximation to Become Equality

    1. Infinitesimally Small Size. (Δθ0) If the change in the parameter Δθ becomes infinitesimally small, the approximation will become an equality. This is because in the limit as Δθ approaches zero, the approximation becomes exact.
    2. Smoothness of the Function. The approximation becomes an equality when the function f(θ) is smooth and differentiable. In this case, the first-order Taylor series approximation becomes exact.
    3. Linear Function. If the function f(θ) is a linear function, the approximation becomes an equality. This is because the gradient of a linear function is constant, and the change in the function is directly proportional to the change in the parameter.

3.2 FBGD

FBGD is short for Full-Batch Gradient Descent. It calculates the gradients of the entire dataset to update the parameters.

  1. Process:

    1. Calculate the gradients of the entire dataset, by averaging the gradients of each data point.
    2. Rewrite the update rule as θθα1Ni=1Ndf(θ)dθ.
    3. Update the parameters θ using the averaged gradients.
    4. Repeat the process until convergence.
    5. The learning rate α is a hyperparameter that needs to be tuned.
  2. True Gradient is the average of the gradients of each data point:

1Ni=1Ndf(θ)dθ
  1. Advantages:

    • Convergence: FBGD can converge to the global minimum of the loss function, given the right learning rate and other hyperparameters.
    • Stability: FBGD is stable and can provide consistent updates to the parameters.
    • Optimal Solution: FBGD can find the optimal solution for convex functions.
  2. Disadvantages:

    • Computational Cost: FBGD can be computationally expensive, especially for large datasets, as it requires calculating the gradients for all data points.
    • Memory Usage: FBGD requires storing the entire dataset in memory to calculate the gradients, which can be memory-intensive.
    • Slow Convergence: FBGD may converge slowly for large datasets, as it updates the parameters based on the average gradients of all data points.

3.3 SGD

SGD is short for Stochastic Gradient Descent. It calculates the gradients of a single data point to update the parameters.

  1. Process:

    1. Calculate the gradient of a single data point.
    2. Update the parameters θ using the gradient of the single data point.
    3. Repeat the process for all data points in the dataset.
    4. The learning rate α is a hyperparameter that needs to be tuned.
  2. Advantages:

    • Efficiency: SGD is computationally efficient, as it updates the parameters based on a single data point at a time.
    • Memory Usage: SGD requires less memory compared to FBGD, as it only needs to store a single data point at a time.
    • Fast Convergence: SGD can converge faster than FBGD for large datasets, as it updates the parameters more frequently.
  3. Disadvantages:

    • Convergence: SGD may not converge to the global minimum, as it updates the parameters based on a single data point at a time.
    • Stability: SGD may provide noisy updates to the parameters, as it updates the parameters based on individual data points.
    • Optimal Solution: SGD may not find the optimal solution for convex functions, as it updates the parameters based on individual data points.

3.3 MBGD

MBGD is short for Mini-Batch Gradient Descent. It calculates the gradients of a mini-batch of data points to update the parameters.

  1. Process:

    1. Calculate the gradients of a mini-batch of data points.
    2. Update the parameters θ using the gradients of the mini-batch.
    3. Repeat the process for all mini-batches in the dataset.
    4. The learning rate α is a hyperparameter that needs to be tuned.
  2. Advantages:

    • Efficiency: MBGD is computationally efficient, as it updates the parameters based on a mini-batch of data points.
    • Memory Usage: MBGD requires less memory compared to FBGD, as it only needs to store a mini-batch of data points at a time.
    • Fast Convergence: MBGD can converge faster than FBGD for large datasets, as it updates the parameters more frequently.
  3. Disadvantages:

    • Choice of Mini-Batch Size: The choice of mini-batch size can affect the convergence and stability of MBGD. A small mini-batch size may lead to noisy updates, while a large mini-batch size may slow down convergence. Usually, the mini-batch size is chosen based on empirical results and computational resources, say the power of 2.

5 Reference

  1. https://engineering.purdue.edu/ChanGroup/ECE302/files/Slide_6_03.pdf
  2. https://courses.grainger.illinois.edu/bioe298b/sp2018/Course Notes (Text)/Chapter07.pdf