Optimization Methods
Disclaimer: These are my personal notes compiled for my own reference and learning. They may contain errors, incomplete information, or personal interpretations. While I strive for accuracy, these notes are not peer-reviewed and should not be considered authoritative sources. Please consult official textbooks, research papers, or other reliable sources for academic or professional purposes.
1. Introduction to Optimization
Optimization is the process of finding the best solution from a set of available alternatives. In machine learning, we typically minimize a loss function:
2. Gradient Descent
The fundamental optimization algorithm for machine learning:
where $\alpha$ is the learning rate and $\nabla L(\theta_t)$ is the gradient.
2.1 Batch Gradient Descent
Uses the entire dataset to compute gradients:
Pros: Stable convergence, guaranteed to find global minimum for convex functions.
Cons: Slow for large datasets, requires full dataset in memory.
2.2 Stochastic Gradient Descent (SGD)
Uses one sample at a time:
Pros: Fast updates, can escape local minima due to noise.
Cons: Noisy convergence, may not converge to exact minimum.
2.3 Mini-batch Gradient Descent
Uses small batches of samples:
where $B$ is a mini-batch. Balances efficiency and stability.
3. Momentum Methods
3.1 Classical Momentum
Accelerates SGD by accumulating velocity:
where $\beta \in [0, 1)$ is the momentum coefficient (typically 0.9).
3.2 Nesterov Accelerated Gradient (NAG)
Looks ahead before computing gradient:
4. Adaptive Learning Rate Methods
4.1 AdaGrad
Adapts learning rate based on historical gradients:
where $\odot$ denotes element-wise multiplication and $\epsilon$ prevents division by zero.
4.2 RMSprop
Uses exponential moving average to prevent aggressive decrease of learning rates:
4.3 Adam (Adaptive Moment Estimation)
Combines momentum and adaptive learning rates:
where $\beta_1 = 0.9$, $\beta_2 = 0.999$, and $\epsilon = 10^{-8}$ are typical values.
5. Second-Order Methods
5.1 Newton's Method
Uses second-order information (Hessian matrix):
where $H(\theta) = \nabla^2 L(\theta)$ is the Hessian matrix.
5.2 Quasi-Newton Methods
BFGS: Approximates the Hessian using gradient information.
L-BFGS: Limited-memory version of BFGS, more practical for large problems.
6. Learning Rate Scheduling
6.1 Step Decay
where $\gamma < 1$ and $s$ is the step size.
6.2 Exponential Decay
6.3 Cosine Annealing
7. Regularization in Optimization
7.1 L1 Regularization (Lasso)
Promotes sparsity in parameters.
7.2 L2 Regularization (Ridge)
Penalizes large weights, equivalent to weight decay in SGD.
7.3 Elastic Net
8. Constrained Optimization
8.1 Lagrange Multipliers
For equality constraints $g(x) = 0$:
8.2 KKT Conditions
For inequality constraints $g(x) \leq 0$:
- Stationarity: $\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) = 0$
- Primal feasibility: $g_i(x^*) \leq 0$
- Dual feasibility: $\lambda_i \geq 0$
- Complementary slackness: $\lambda_i g_i(x^*) = 0$
9. Optimization for Deep Learning
9.1 Gradient Clipping
Prevents exploding gradients:
9.2 Batch Normalization Impact
Smooths the loss landscape, allowing higher learning rates.
9.3 Learning Rate Warmup
Gradually increase learning rate at the beginning of training.
10. Non-convex Optimization
Most deep learning problems are non-convex. Key challenges:
- Local minima: May not be global optima
- Saddle points: Points where gradient is zero but not optimal
- Plateau regions: Flat areas with small gradients
11. Hyperparameter Optimization
11.1 Grid Search
Exhaustive search over parameter grid.
11.2 Random Search
Random sampling from parameter distributions.
11.3 Bayesian Optimization
Uses probabilistic model to guide search.
12. Code Example
# Python implementation of optimization algorithms
import numpy as np
import matplotlib.pyplot as plt
class Optimizer:
def __init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
def update(self, params, gradients):
raise NotImplementedError
class SGD(Optimizer):
def __init__(self, learning_rate=0.01, momentum=0.0):
super().__init__(learning_rate)
self.momentum = momentum
self.velocity = None
def update(self, params, gradients):
if self.velocity is None:
self.velocity = np.zeros_like(params)
self.velocity = self.momentum * self.velocity - self.learning_rate * gradients
params += self.velocity
return params
class Adam(Optimizer):
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
super().__init__(learning_rate)
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None
self.v = None
self.t = 0
def update(self, params, gradients):
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t += 1
# Update biased first and second moment estimates
self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)
# Bias correction
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
# Update parameters
params -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
return params
class RMSprop(Optimizer):
def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
super().__init__(learning_rate)
self.decay_rate = decay_rate
self.epsilon = epsilon
self.cache = None
def update(self, params, gradients):
if self.cache is None:
self.cache = np.zeros_like(params)
self.cache = self.decay_rate * self.cache + (1 - self.decay_rate) * (gradients ** 2)
params -= self.learning_rate * gradients / (np.sqrt(self.cache) + self.epsilon)
return params
def rosenbrock(x, y):
"""Rosenbrock function - classic optimization test function"""
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_gradient(x, y):
"""Gradient of Rosenbrock function"""
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return np.array([dx, dy])
def optimize_function(optimizer, start_point, num_iterations=1000):
"""Optimize Rosenbrock function using given optimizer"""
point = np.array(start_point, dtype=float)
history = [point.copy()]
for i in range(num_iterations):
grad = rosenbrock_gradient(point[0], point[1])
point = optimizer.update(point, grad)
history.append(point.copy())
return np.array(history)
# Example usage
if __name__ == "__main__":
start_point = [-1.5, 2.0]
num_iterations = 1000
# Test different optimizers
optimizers = {
'SGD': SGD(learning_rate=0.001),
'SGD with Momentum': SGD(learning_rate=0.001, momentum=0.9),
'Adam': Adam(learning_rate=0.01),
'RMSprop': RMSprop(learning_rate=0.01)
}
results = {}
for name, optimizer in optimizers.items():
# Reset optimizer state
optimizer.__init__(**optimizer.__dict__)
history = optimize_function(optimizer, start_point, num_iterations)
results[name] = history
final_point = history[-1]
final_value = rosenbrock(final_point[0], final_point[1])
print(f"{name}: Final point = ({final_point[0]:.4f}, {final_point[1]:.4f}), "
f"Final value = {final_value:.6f}")
# Plot optimization paths
plt.figure(figsize=(12, 8))
# Create contour plot
x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
plt.colorbar(label='Function value')
# Plot optimization paths
colors = ['red', 'blue', 'green', 'orange']
for i, (name, history) in enumerate(results.items()):
plt.plot(history[:, 0], history[:, 1], colors[i],
label=name, linewidth=2, alpha=0.7)
plt.plot(history[0, 0], history[0, 1], 'ko', markersize=8) # Start
plt.plot(history[-1, 0], history[-1, 1], colors[i]+'o', markersize=8) # End
plt.plot(1, 1, 'r*', markersize=15, label='Global minimum')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Optimization Paths on Rosenbrock Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
13. References
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization.
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.
- Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.