Skip to content

Commit d8d5f33

Browse files
committed
sync(zh): 同步 HuggingFace dataset canonical 路径修复(上游 #180)
HuggingFace 改了 dataset 命名规范,旧短 ID 弃用。上游把 09-data-management 的 dataset 路径更新为新 canonical 路径,1:1 同步: - code/data_utils.py:rotten_tomatoes → cornell-movie-review-data/rotten_tomatoes - outputs/prompt-data-helper.md:9 个 dataset 路径表格更新(纯英文 skill 文档, 无中文翻译,整体对齐上游) 注:本批上游其余改动(4 课 docs 的交互式 figure 块、figures.js 图表引擎、 SEO 基建、About 页)因绑定站点功能或需中文适配,拆分到后续 PR 单独处理。
1 parent 3c368c3 commit d8d5f33

2 files changed

Lines changed: 11 additions & 12 deletions

File tree

phases/00-setup-and-tooling/09-data-management/code/data_utils.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import os
21
import sys
32
import json
43
import hashlib
@@ -159,10 +158,10 @@ def fingerprint(ds, num_rows: int = 100):
159158
print("=" * 60)
160159

161160
print("\n--- 1. Load and inspect a dataset ---")
162-
ds = load_and_inspect("rotten_tomatoes", split="train")
161+
ds = load_and_inspect("cornell-movie-review-data/rotten_tomatoes", split="train")
163162

164163
print("\n--- 2. Stream a dataset ---")
165-
rows = stream_dataset("rotten_tomatoes", max_rows=3)
164+
rows = stream_dataset("cornell-movie-review-data/rotten_tomatoes", max_rows=3)
166165
for row in rows:
167166
print(f" {row['text'][:80]}...")
168167

phases/00-setup-and-tooling/09-data-management/outputs/prompt-data-helper.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -33,15 +33,15 @@ Common task-to-dataset mapping:
3333

3434
| Task | Starter Dataset | HF ID |
3535
|------|----------------|-------|
36-
| Text classification | Rotten Tomatoes | `rotten_tomatoes` |
37-
| Sentiment analysis | IMDB | `imdb` |
38-
| Natural language inference | MNLI | `glue/mnli` |
39-
| Question answering | SQuAD | `squad` |
40-
| Summarization | CNN/DailyMail | `cnn_dailymail` |
41-
| Translation | WMT | `wmt16` |
42-
| Language modeling | WikiText | `wikitext` |
43-
| Token classification | CoNLL-2003 | `conll2003` |
44-
| Image classification | MNIST / CIFAR-10 | `mnist` / `cifar10` |
36+
| Text classification | Rotten Tomatoes | `cornell-movie-review-data/rotten_tomatoes` |
37+
| Sentiment analysis | IMDB | `stanfordnlp/imdb` |
38+
| Natural language inference | MNLI | `nyu-mll/glue` (config:`mnli`) |
39+
| Question answering | SQuAD | `rajpurkar/squad` |
40+
| Summarization | CNN/DailyMail | `abisee/cnn_dailymail`(config: `3.0.0`) |
41+
| Translation | WMT | `wmt/wmt16`(config: `cs-en`) |
42+
| Language modeling | WikiText | `Salesforce/wikitext` |
43+
| Token classification | CoNLL-2003 | `lhoestq/conll2003` |
44+
| Image classification | MNIST / CIFAR-10 | `ylecun/mnist` / `uoft-cs/cifar10` |
4545
| Object detection | COCO | `detection-datasets/coco` |
4646

4747
When recommending, prefer smaller datasets for learning and prototyping. Suggest larger datasets only when the user is ready to train at scale.

0 commit comments

Comments
 (0)