DeepScaler demonstrates that incrementally increasing context length (8K -> 16K -> 24K) during GPT training allows even small 1.5B parameter LLMs to outperform O1 models on mathematical tasks while significantly reduce the FLOPs compared to full-context RL. Their multi-stage reinforcement learning training approach might be worth experimenting with for our project?
https://github.com/agentica-project/deepscaler
DeepScaler demonstrates that incrementally increasing context length (8K -> 16K -> 24K) during GPT training allows even small 1.5B parameter LLMs to outperform O1 models on mathematical tasks while significantly reduce the FLOPs compared to full-context RL. Their multi-stage reinforcement learning training approach might be worth experimenting with for our project?
https://github.com/agentica-project/deepscaler