Optimizer¶

An optimizer is an algorithm or method used to update the parameters of a machine learning model (e.g., weights of a neural network) in order to minimize the loss function. It implements a specific variant of gradient‑based optimization, determining how the model learns from the data.

Core Idea¶

During training, we compute the gradient of the loss with respect to each parameter. The optimizer then uses these gradients to adjust the parameters in the direction that reduces the loss. The way this adjustment is performed—how large a step to take, how to adapt the step size over time, how to incorporate past gradients—defines the optimizer.

Key Properties¶

Learning rate (η): A hyperparameter that controls the step size. Too high can cause divergence; too low slows convergence.
Convergence guarantees: Some optimizers guarantee convergence to a local minimum under certain conditions.
Computational efficiency: Memory and time per iteration.
Adaptability: Whether the optimizer adapts the learning rate per parameter.

Common Optimizers¶

1. Basic Stochastic Gradient Descent (SGD)¶

Updates parameters using the current gradient:

\[ \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \]

Pros: Simple, well‑understood.
Cons: Can be slow, sensitive to learning rate choice, and may oscillate in ravines.

2. SGD with Momentum¶

Accumulates a velocity vector to smooth updates and dampen oscillations:

\[ v_{t+1} = \beta v_t + \nabla L(\theta_t) \]

\[ \theta_{t+1} = \theta_t - \eta v_{t+1} \]

Typical β = 0.9.
Helps accelerate convergence in relevant directions.

3. Nesterov Accelerated Gradient (NAG)¶

A variation of momentum that looks ahead:

\[ v_{t+1} = \beta v_t + \nabla L(\theta_t - \beta v_t) \]

\[ \theta_{t+1} = \theta_t - \eta v_{t+1} \]

Often converges faster and with better responsiveness.

4. AdaGrad¶

Adapts the learning rate per parameter based on the sum of past squared gradients:

\[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla L(\theta_t) \]

where \( G_t \) accumulates squared gradients.
- Pros: Good for sparse features (e.g., NLP).
- Cons: Learning rate may shrink too aggressively, stopping learning.

5. RMSProp¶

Modifies AdaGrad by using a moving average of squared gradients:

\[ E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) (\nabla L(\theta_t))^2 \]

\[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla L(\theta_t) \]

Works well in non‑stationary settings (e.g., online learning, RNNs).

6. Adam (Adaptive Moment Estimation)¶

Combines momentum and RMSProp. Maintains both first moment (mean) and second moment (uncentered variance) of gradients:

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L(\theta_t) \]

\[ v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla L(\theta_t))^2 \]

\[ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)} \]

\[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \]

Pros: Often works well out‑of‑the‑box, handles sparse gradients, adaptive.
Cons: Can sometimes generalize slightly worse than well‑tuned SGD.

7. AdamW¶

A variant of Adam that decouples weight decay from the gradient update, improving generalization.

8. Other Optimizers¶

Nadam: Adam with Nesterov momentum.
AMSGrad: Fixes a convergence issue in Adam by using maximum of past squared gradients.
LAMB: Layer‑wise Adaptive Moments for large‑batch training.

Choosing an Optimizer¶

Default choice: Adam is often a good starting point because it works well across many tasks with minimal tuning.
When to use SGD with Momentum: For computer vision tasks (CNNs) and when you want to carefully tune learning rate schedules; can yield better generalization.
For sparse data: AdaGrad or Adam.
For large‑batch training: LAMB or AdamW with appropriate scaling.
For RNNs/Transformers: Adam/AdamW is commonly used.

Learning Rate Schedules¶

Optimizers often work in tandem with learning rate schedules (step decay, exponential decay, cosine annealing, warm‑up) to improve convergence.

Summary¶

The optimizer determines how a model learns from gradients. Selecting the right optimizer and tuning its hyperparameters (learning rate, momentum, etc.) is crucial for efficient training and good final performance. Modern deep learning frameworks (TensorFlow, PyTorch) provide these optimizers and allow easy customization.