Gradient Descent
1 Convex and Non-Convex Optimization
Which value of
1. 1 Convex Functions
A convex function is a continuous function whose value at the midpoint of every interval in its domain does not exceed the arithmetic mean of its values at the ends of the interval.
Definition
A function
is convex if for any
.
This means the line segment between any two points on the curve of the function lies above or on the curve itself.
For 1D functions, they are convex if
And for a convex function
If we suppose
Note that
can be high-dimensional.
2 Argmin and Argmax
Definition
The input to a function that yields the minimum is called the
argmin, since it is the argument to the function that gives the minimum.Similarly, the
argmaxof a function is the input that gives the function's maximum.
And the value of
But if function is Non-Convex, we should use gradient descent, the golden rule of non-convex optimization, widely applied in machine learning / deep learning.
3 The Golden Method - Gradient Descent
3.1 Definition
First, let's recap some basic descent methods. Take the form
The gradients are just first-order derivatives
Hence, gradient descent, namely, following the negative direction of the gradients of the curve, is defined as follows:
Where,
is the new / updated variable after single step of gradient descent; - 𝛼 is the step size or the learning rate that controls the stride;
is the resulted change of variable value.
In this case, we are moving towards the negative direction of the gradients. So if t omitted,
Then looking at the function change after changing the variable, i.e.
Since
If result satisfies accuracy
Question
However, the choice of learning rate
is crucial. What if learning rate is too small, or too large? - When the Learning Rate
is too large:
- Overshooting. A large learning rate may push
far past the minimum, and the algorithm may fail to converge. - Oscillations or Divergence. If learning rate is excessively large, the algorithm may push
away from the minimum. This can lead to oscillations or divergence. - Loss of Critical Information.In a non-convex landscape, the algorithm might skip over local minima or saddle points because of excessively large steps.
- When the Learning Rate
is too small:
- Slow Convergence. A small learning rate may lead to slow convergence, as the algorithm takes tiny steps towards the minimum.
- Stuck in Local Minima. A small learning rate may cause the algorithm to get stuck in local minima, as it is unable to escape the current position.
- Sensitive to Initial Conditions. A small learning rate may make the algorithm sensitive to initial conditions, as it may get stuck in a local minimum based on the initial position of
. - Vanishing Updates. A small learning rate may cause the updates to become vanishingly small, leading to slow convergence or stagnation.
- When the Learning Rate
Question
. Why do we use approximation, and under what condition, the approximation can be changed to equality? Why Use Approximation
We use approximation because in gradient descent, the change in the parameter
, denoted as , is typically small. For small changes, the change in the objective function can be approximated by the product of the gradient and the change in the parameter . This is a first-order Taylor series approximation. Condition for Approximation to Become Equality
- Infinitesimally Small Size. (
) If the change in the parameter becomes infinitesimally small, the approximation will become an equality. This is because in the limit as approaches zero, the approximation becomes exact. - Smoothness of the Function. The approximation becomes an equality when the function
is smooth and differentiable. In this case, the first-order Taylor series approximation becomes exact. - Linear Function. If the function
is a linear function, the approximation becomes an equality. This is because the gradient of a linear function is constant, and the change in the function is directly proportional to the change in the parameter.
- Infinitesimally Small Size. (
3.2 FBGD
FBGD is short for Full-Batch Gradient Descent. It calculates the gradients of the entire dataset to update the parameters.
Process:
- Calculate the gradients of the entire dataset, by averaging the gradients of each data point.
- Rewrite the update rule as
. - Update the parameters
using the averaged gradients. - Repeat the process until convergence.
- The learning rate
is a hyperparameter that needs to be tuned.
True Gradient is the average of the gradients of each data point:
Advantages:
- Convergence: FBGD can converge to the global minimum of the loss function, given the right learning rate and other hyperparameters.
- Stability: FBGD is stable and can provide consistent updates to the parameters.
- Optimal Solution: FBGD can find the optimal solution for convex functions.
Disadvantages:
- Computational Cost: FBGD can be computationally expensive, especially for large datasets, as it requires calculating the gradients for all data points.
- Memory Usage: FBGD requires storing the entire dataset in memory to calculate the gradients, which can be memory-intensive.
- Slow Convergence: FBGD may converge slowly for large datasets, as it updates the parameters based on the average gradients of all data points.
3.3 SGD
SGD is short for Stochastic Gradient Descent. It calculates the gradients of a single data point to update the parameters.
Process:
- Calculate the gradient of a single data point.
- Update the parameters
using the gradient of the single data point. - Repeat the process for all data points in the dataset.
- The learning rate
is a hyperparameter that needs to be tuned.
Advantages:
- Efficiency: SGD is computationally efficient, as it updates the parameters based on a single data point at a time.
- Memory Usage: SGD requires less memory compared to FBGD, as it only needs to store a single data point at a time.
- Fast Convergence: SGD can converge faster than FBGD for large datasets, as it updates the parameters more frequently.
Disadvantages:
- Convergence: SGD may not converge to the global minimum, as it updates the parameters based on a single data point at a time.
- Stability: SGD may provide noisy updates to the parameters, as it updates the parameters based on individual data points.
- Optimal Solution: SGD may not find the optimal solution for convex functions, as it updates the parameters based on individual data points.
3.3 MBGD
MBGD is short for Mini-Batch Gradient Descent. It calculates the gradients of a mini-batch of data points to update the parameters.
Process:
- Calculate the gradients of a mini-batch of data points.
- Update the parameters
using the gradients of the mini-batch. - Repeat the process for all mini-batches in the dataset.
- The learning rate
is a hyperparameter that needs to be tuned.
Advantages:
- Efficiency: MBGD is computationally efficient, as it updates the parameters based on a mini-batch of data points.
- Memory Usage: MBGD requires less memory compared to FBGD, as it only needs to store a mini-batch of data points at a time.
- Fast Convergence: MBGD can converge faster than FBGD for large datasets, as it updates the parameters more frequently.
Disadvantages:
- Choice of Mini-Batch Size: The choice of mini-batch size can affect the convergence and stability of MBGD. A small mini-batch size may lead to noisy updates, while a large mini-batch size may slow down convergence. Usually, the mini-batch size is chosen based on empirical results and computational resources, say the power of 2.