Skip to content

OpenWebMath Crawl Annealing Experiments #1167

@krypticmouse

Description

@krypticmouse

Description

We crawled the outlinks in records of OpenWebMath HTML and got our newly crawled dataset. We now want to run @Helw150's quality ablation on this and assess how good this crawled dataset is.

As part of the experiments we created three datasets: OWT Control which was the original OWT dataset, Crawled OWT with pages having newly crawled dataset from OWT and a combined dataset of Control + Crawl.

Alongside these we also do a "token proportional" run where we provide data mixture weights to control and crawl in control+crawl runs based on the ratio of their respective token counts.

The crawled dataset doesn't contain any documents from the original OWT dataset, but is created with the same filtration and processing pipeline as OWT. For pipeline overview see: #868.

Hypothesis or Goal

We hope to show that adding the newly crawled OpenWebMath dataset helps. To validate this we should see improvements in Math evaluation when using crawl + control.

Links

  • WandB Report: link
  • Wand Report - Token Proportional: link
  • Data Browser: link
  • Experiment JSON: link

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions