Q&A: Phase 50.4 MetaOptimizer — Learned Optimization & Update Rules #982
Unanswered
web3guru888
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Q&A: Phase 50.4 MetaOptimizer
Discussion Topics
Q1: How does a learned optimizer differ from a traditional optimizer?
Traditional optimizers (SGD, Adam) use fixed, hand-designed update rules. A learned optimizer is a neural network (typically LSTM) that takes gradients as input and outputs parameter updates. The key insight from Andrychowicz et al. (2016) is that the optimizer itself can be trained via meta-learning — optimizing the optimizer's parameters so that the models it trains converge faster.
Q2: What does "coordinatewise" mean for the LSTM optimizer?
Instead of one large LSTM processing all parameters jointly (which would be O(n²) in model size), a coordinatewise LSTM processes each parameter independently with shared LSTM weights. The input is a scalar gradient value; the output is a scalar update. This makes the LSTM optimizer applicable to models of any size, since the LSTM weight count is fixed regardless of the base model's parameter count.
Q3: How is the LSTM optimizer trained (what is the meta-loss)?
The meta-loss is the final task loss after T steps of the learned optimizer. We unroll the optimization trajectory for T steps (e.g., 20), compute the loss at the end, and backpropagate through the entire unrolled computation to update the LSTM's weights. This is truncated backpropagation through time (BPTT) applied to the optimization process itself.
Q4: Can learned optimizers generalize to new tasks and architectures?
Yes, but with caveats:
Q5: What is Warp-Grad and how does it relate to Meta-SGD?
Meta-SGD learns per-parameter learning rates. Warp-Grad (Flennerhag et al., 2020) goes further — it learns a preconditioning matrix that warps the task loss surface, making gradient descent more effective. Warp-Grad subsumes Meta-SGD as a special case (diagonal preconditioning = per-parameter LR).
Related
Beta Was this translation helpful? Give feedback.
All reactions