Hi, thanks for sharing the datasets! I'm trying to train a flan model using t5 and other backbone models. However i'm not confident enough on how well I reproduced your results. Specifically I got much lower MMLU scores. Could you please share the training loss curve (or simply the loss at convergence?) Below is mine:

I was using similar settings (batch size = 80, max_seq_len = 2300)
The final loss is around 0.6 after smoothing. What about the official values?
Hi, thanks for sharing the datasets! I'm trying to train a flan model using t5 and other backbone models. However i'm not confident enough on how well I reproduced your results. Specifically I got much lower MMLU scores. Could you please share the training loss curve (or simply the loss at convergence?) Below is mine:

I was using similar settings (batch size = 80, max_seq_len = 2300)
The final loss is around 0.6 after smoothing. What about the official values?