Improve memory efficiency of FP16 optimizer

The pytorch/fairseq team improved the memory effiency of their FP16 optimizer by converting the FP16 parameters to FP32 on the fly instead of keeping a static copy, see https://github.com/pytorch/fairseq/pull/404.

Are there any plans to implement this optimization here?

Thanks!