Skip to content

Optimizer Scaling Law Part 1: AdamW #725

@WhenWen

Description

@WhenWen

Description

To choose the best optimizer for pretraining experiments, we need to select both (1) the optimizer algorithm (e.g. AdamW, RMSProp, etc.) and (2) hyperparameters for the selected optimizers, conditioning on the scale of model and data. Comparison of different optimizers should be done when the hyperparameters are tuned to be near optimal for each optimizer.

However, one can't choose hyperparameters through a brute-force grid search for large-scale models or data. We then need a principled way to calculate hyperparameters near optimally for different models and data scales. In this set of experiments, we will start with AdamW.

Methodology

Local Optima Seeking: 1D Grid Search + Coordinate Descent

Fixing a model scale and data scale, we use the following procedure to determine the local optimal hyperparameter set

  1. Select Hyperparameters

Choose a set of hyperparameters for the optimizer. For AdamW, the hyperparameters include learning rate ($\eta$), final learning rate ratio ($\eta_{\min}/\eta$), warmup steps ($s$), momentum/decay coefficients ($\beta_1$, $\beta_2$), weight decay ($\lambda$), $\epsilon$ and gradient clipping constant $c$.

  1. Define Search Space

Use a discrete grid for each hyperparameter. For AdamW, an example table is:

Hyperparameter Example Grid
Learning Rate ($\eta$) ${1 \times 10^{-3}, 2 \times 10^{-3}, 4 \times 10^{-3}, 8 \times 10^{-3} }$
Final Learning Rate Ratio ($\eta_{\min}/\eta$) ${0, 0.05, 0.1}$
Warmup Step ($s$) ${1k, 2k, 4k, 8k }$
$\beta_1$ ${0.9, 0.95, 0.98, 0.99}$
Weight Decay ($\lambda$) ${0, 5 \times 10^{-2}, 1 \times 10^{-1}, 2 \times 10^{-1} }$
$\beta_2$ ${0.9, 0.95, 0.98, 0.99 }$
$\epsilon$ ${1 \times 10^{-10}, 1 \times 10^{-20}, 1 \times 10^{-30} }$
$c$ ${0(\text{no clipping}), 0.5,1}$

Keep grids small enough for feasible training runs but wide enough to capture important variations.

  1. Establish a Baseline

Set an initial hyperparameter configuration as a baseline, and use this baseline to begin coordinate descent.

  1. Coordinate Descent in the hyperparameter space
  1. Fix all hyperparameters except one.
  2. Sweep over the chosen hyperparameter (using your grid).
  3. Train (for a set number of steps or epochs) and evaluate performance.
  4. Select the best value based on validation metrics.
  5. Update the baseline with that best value.
  6. Repeat for the next hyperparameter.
  7. Do multiple passes until convergence.

Determine Sensitive Hyperparameters

Using the coordinate descent procedure, a nice product is that we will get how the loss varies when we change each dimension. If we perform the experiments for three scales: (0.1B params, 10B tokens), (0.1B params, 50B tokens), (0.5B params, 10B tokens). We can then group the hyperparameters by the following two standards:

  1. Whether the optima shifts when we scale the data and the model (for learning rate, we can utilize the theoretical insight that $\eta \sqrt{width}$ should remain approximately constant)
  2. Whether the loss is sensitive to this hyperparameter

If both standards turn out to be true, we identify this hyperparameter as sensitive (to scale). For the others, we should set the hyperparameters as constant. We can also see whether the cause of optima shifts is due to scaling the data or scaling the model.

Scaling Law for Sensitive Hyperparameters

After deciding on the sensitive hyperparameters, we should perform coordinate descent constrained on the fixed set over more scales of model and data to find parameterized scaling rules for the important hyperparameters.

Related Works

  1. https://arxiv.org/abs/2407.05872
  2. https://arxiv.org/abs/2407.07972
  3. https://arxiv.org/abs/2410.21676

Hypothesis or Goal

For AdamW, we likely know how to scale most of the hyperparameters already. To be more precise, the conjecture is that:
Non-sensitive hyperparameters:

  1. learning rate $\eta$ should scale inversely with ${width}$
  2. weight decay should be 0.1 and gradient clipping should be 1
  3. $\epsilon$ should be insensitive (if there is no numerical issue)

Potentially sensitive hyperparameters:

  1. Reportedly higher $\beta_1$ and $\beta_2$ will be beneficial for long duration training. However, I don't think a proper scaling law exists.
  2. The warmup step is mysterious. Whether it should scale with training duration or be a constant (leaning towards the latter) is unknown.

This should serve as a strong baseline for testing future optimizers.

Links

Results

After sweeping, we discovered that the (near) optimal set of hyperparameters for AdamW remains surprisingly stable across three settings.

Parameter Value
learning rate × width 1.6e-2 x 512
weight decay 0.1
min lr ratio 0
warmup 2000
beta1 0.9
beta2 0.95
epsilon 1e-15
max_grad_norm 1.0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions