TLDR
Central issue for curating datasets and strategies for multilingual training, while maintaining strong English performance.
Goals
- Improve multilingual performance
- Avoid degrading English-only performance
- Support scalable data loading (iterable/streaming)
Tasks
1. Dataset Collection
- Share relevant multilingual datasets (QA, multimodal, instruction)
- Add links + brief notes
2. Data Processing
- Standardize formats for training
- Convert to iterable/streaming datasets
3. Data Mixing
- Explore mixing strategies (e.g., proportional, balanced sampling)
- Share insights/resources
4. Evaluation
- Evaluate on both multilingual + English benchmarks
- Track performance trade-offs
Dependencies
- Baseline evaluations completed
- Evaluation pipelines ready
Action Items
TLDR
Central issue for curating datasets and strategies for multilingual training, while maintaining strong English performance.
Goals
Tasks
1. Dataset Collection
2. Data Processing
3. Data Mixing
4. Evaluation
Dependencies
Action Items