Q&A: Phase 50.4 MetaOptimizer

web3guru888 · 2026-04-14T00:12:31Z

web3guru888
Apr 14, 2026
Maintainer

Discussion Topics

Q1: How does a learned optimizer differ from a traditional optimizer?

Traditional optimizers (SGD, Adam) use fixed, hand-designed update rules. A learned optimizer is a neural network (typically LSTM) that takes gradients as input and outputs parameter updates. The key insight from Andrychowicz et al. (2016) is that the optimizer itself can be trained via meta-learning — optimizing the optimizer's parameters so that the models it trains converge faster.

Q2: What does "coordinatewise" mean for the LSTM optimizer?

Instead of one large LSTM processing all parameters jointly (which would be O(n²) in model size), a coordinatewise LSTM processes each parameter independently with shared LSTM weights. The input is a scalar gradient value; the output is a scalar update. This makes the LSTM optimizer applicable to models of any size, since the LSTM weight count is fixed regardless of the base model's parameter count.

Q3: How is the LSTM optimizer trained (what is the meta-loss)?

The meta-loss is the final task loss after T steps of the learned optimizer. We unroll the optimization trajectory for T steps (e.g., 20), compute the loss at the end, and backpropagate through the entire unrolled computation to update the LSTM's weights. This is truncated backpropagation through time (BPTT) applied to the optimization process itself.

Q4: Can learned optimizers generalize to new tasks and architectures?

Yes, but with caveats:

Same architecture, new tasks: Strong generalization (the optimizer learns task-distribution-specific update patterns)
New architectures: Coordinatewise LSTMs transfer well because they operate per-parameter. Li & Malik (2017) showed transfer across architectures with only modest accuracy drops.
Very different scales: May need re-normalization of gradients for stable transfer

Q5: What is Warp-Grad and how does it relate to Meta-SGD?

Meta-SGD learns per-parameter learning rates. Warp-Grad (Flennerhag et al., 2020) goes further — it learns a preconditioning matrix that warps the task loss surface, making gradient descent more effective. Warp-Grad subsumes Meta-SGD as a special case (diagonal preconditioning = per-parameter LR).

Issue Phase 50.4: MetaOptimizer — Learned Optimizers & Learning Rate Adaptation #973 — MetaOptimizer specification
Discussion Show & Tell: Phase 50.4 MetaOptimizer — Learned Optimizers & Meta-SGD #981 — Show & Tell
Discussion Phase 50 Planning — Meta-Learning & Learning-to-Learn Architectures #969 — Phase 50 planning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q&A: Phase 50.4 MetaOptimizer — Learned Optimization & Update Rules #982

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Q&A: Phase 50.4 MetaOptimizer — Learned Optimization & Update Rules #982

Uh oh!

web3guru888 Apr 14, 2026 Maintainer