Description
To choose the best optimizer for pretraining experiments, we need to select both (1) the optimizer algorithm (e.g. AdamW, RMSProp, etc.) and (2) hyperparameters for the selected optimizers, conditioning on the scale of model and data. Comparison of different optimizers should be done when the hyperparameters are tuned to be near optimal for each optimizer.
However, one can't choose hyperparameters through a brute-force grid search for large-scale models or data. We then need a principled way to calculate hyperparameters near optimally for different models and data scales. In this set of experiments, we will start with AdamW.
Methodology
Local Optima Seeking: 1D Grid Search + Coordinate Descent
Fixing a model scale and data scale, we use the following procedure to determine the local optimal hyperparameter set
- Select Hyperparameters
Choose a set of hyperparameters for the optimizer. For AdamW, the hyperparameters include learning rate ($\eta$), final learning rate ratio ($\eta_{\min}/\eta$), warmup steps ($s$), momentum/decay coefficients ($\beta_1$, $\beta_2$), weight decay ($\lambda$), $\epsilon$ and gradient clipping constant $c$.
- Define Search Space
Use a discrete grid for each hyperparameter. For AdamW, an example table is:
| Hyperparameter |
Example Grid |
| Learning Rate ($\eta$) |
${1 \times 10^{-3}, 2 \times 10^{-3}, 4 \times 10^{-3}, 8 \times 10^{-3} }$ |
| Final Learning Rate Ratio ($\eta_{\min}/\eta$) |
${0, 0.05, 0.1}$ |
| Warmup Step ($s$) |
${1k, 2k, 4k, 8k }$ |
| $\beta_1$ |
${0.9, 0.95, 0.98, 0.99}$ |
| Weight Decay ($\lambda$) |
${0, 5 \times 10^{-2}, 1 \times 10^{-1}, 2 \times 10^{-1} }$ |
| $\beta_2$ |
${0.9, 0.95, 0.98, 0.99 }$ |
| $\epsilon$ |
${1 \times 10^{-10}, 1 \times 10^{-20}, 1 \times 10^{-30} }$ |
| $c$ |
${0(\text{no clipping}), 0.5,1}$ |
Keep grids small enough for feasible training runs but wide enough to capture important variations.
- Establish a Baseline
Set an initial hyperparameter configuration as a baseline, and use this baseline to begin coordinate descent.
- Coordinate Descent in the hyperparameter space
- Fix all hyperparameters except one.
- Sweep over the chosen hyperparameter (using your grid).
- Train (for a set number of steps or epochs) and evaluate performance.
- Select the best value based on validation metrics.
- Update the baseline with that best value.
- Repeat for the next hyperparameter.
- Do multiple passes until convergence.
Determine Sensitive Hyperparameters
Using the coordinate descent procedure, a nice product is that we will get how the loss varies when we change each dimension. If we perform the experiments for three scales: (0.1B params, 10B tokens), (0.1B params, 50B tokens), (0.5B params, 10B tokens). We can then group the hyperparameters by the following two standards:
- Whether the optima shifts when we scale the data and the model (for learning rate, we can utilize the theoretical insight that $\eta \sqrt{width}$ should remain approximately constant)
- Whether the loss is sensitive to this hyperparameter
If both standards turn out to be true, we identify this hyperparameter as sensitive (to scale). For the others, we should set the hyperparameters as constant. We can also see whether the cause of optima shifts is due to scaling the data or scaling the model.
Scaling Law for Sensitive Hyperparameters
After deciding on the sensitive hyperparameters, we should perform coordinate descent constrained on the fixed set over more scales of model and data to find parameterized scaling rules for the important hyperparameters.
Related Works
- https://arxiv.org/abs/2407.05872
- https://arxiv.org/abs/2407.07972
- https://arxiv.org/abs/2410.21676
Hypothesis or Goal
For AdamW, we likely know how to scale most of the hyperparameters already. To be more precise, the conjecture is that:
Non-sensitive hyperparameters:
- learning rate $\eta$ should scale inversely with ${width}$
- weight decay should be 0.1 and gradient clipping should be 1
-
$\epsilon$ should be insensitive (if there is no numerical issue)
Potentially sensitive hyperparameters:
- Reportedly higher $\beta_1$ and $\beta_2$ will be beneficial for long duration training. However, I don't think a proper scaling law exists.
- The warmup step is mysterious. Whether it should scale with training duration or be a constant (leaning towards the latter) is unknown.
This should serve as a strong baseline for testing future optimizers.
Links
Results
After sweeping, we discovered that the (near) optimal set of hyperparameters for AdamW remains surprisingly stable across three settings.
| Parameter |
Value |
| learning rate × width |
1.6e-2 x 512 |
| weight decay |
0.1 |
| min lr ratio |
0 |
| warmup |
2000 |
| beta1 |
0.9 |
| beta2 |
0.95 |
| epsilon |
1e-15 |
| max_grad_norm |
1.0 |
Description
To choose the best optimizer for pretraining experiments, we need to select both (1) the optimizer algorithm (e.g. AdamW, RMSProp, etc.) and (2) hyperparameters for the selected optimizers, conditioning on the scale of model and data. Comparison of different optimizers should be done when the hyperparameters are tuned to be near optimal for each optimizer.
However, one can't choose hyperparameters through a brute-force grid search for large-scale models or data. We then need a principled way to calculate hyperparameters near optimally for different models and data scales. In this set of experiments, we will start with AdamW.
Methodology
Local Optima Seeking: 1D Grid Search + Coordinate Descent
Fixing a model scale and data scale, we use the following procedure to determine the local optimal hyperparameter set
Choose a set of hyperparameters for the optimizer. For AdamW, the hyperparameters include learning rate ($\eta$ ), final learning rate ratio ($\eta_{\min}/\eta$ ), warmup steps ($s$ ), momentum/decay coefficients ($\beta_1$ , $\beta_2$ ), weight decay ($\lambda$ ), $\epsilon$ and gradient clipping constant $c$ .
Use a discrete grid for each hyperparameter. For AdamW, an example table is:
Keep grids small enough for feasible training runs but wide enough to capture important variations.
Set an initial hyperparameter configuration as a baseline, and use this baseline to begin coordinate descent.
Determine Sensitive Hyperparameters
Using the coordinate descent procedure, a nice product is that we will get how the loss varies when we change each dimension. If we perform the experiments for three scales: (0.1B params, 10B tokens), (0.1B params, 50B tokens), (0.5B params, 10B tokens). We can then group the hyperparameters by the following two standards:
If both standards turn out to be true, we identify this hyperparameter as sensitive (to scale). For the others, we should set the hyperparameters as constant. We can also see whether the cause of optima shifts is due to scaling the data or scaling the model.
Scaling Law for Sensitive Hyperparameters
After deciding on the sensitive hyperparameters, we should perform coordinate descent constrained on the fixed set over more scales of model and data to find parameterized scaling rules for the important hyperparameters.
Related Works
Hypothesis or Goal
For AdamW, we likely know how to scale most of the hyperparameters already. To be more precise, the conjecture is that:
Non-sensitive hyperparameters:
Potentially sensitive hyperparameters:
This should serve as a strong baseline for testing future optimizers.
Links
Results
After sweeping, we discovered that the (near) optimal set of hyperparameters for AdamW remains surprisingly stable across three settings.