Skip to content

Multilingual Data Curation #48

@engichang1467

Description

@engichang1467

TLDR

Central issue for curating datasets and strategies for multilingual training, while maintaining strong English performance.


Goals

  • Improve multilingual performance
  • Avoid degrading English-only performance
  • Support scalable data loading (iterable/streaming)

Tasks

1. Dataset Collection

  • Share relevant multilingual datasets (QA, multimodal, instruction)
  • Add links + brief notes

2. Data Processing

  • Standardize formats for training
  • Convert to iterable/streaming datasets

3. Data Mixing

  • Explore mixing strategies (e.g., proportional, balanced sampling)
  • Share insights/resources

4. Evaluation

  • Evaluate on both multilingual + English benchmarks
  • Track performance trade-offs

Dependencies

  • Baseline evaluations completed
  • Evaluation pipelines ready

Action Items

  • Add datasets
  • Propose mixing strategies
  • Implement iterable dataset pipeline
  • Define eval protocol
  • Log results

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions