Skip to content

Commit 8a74728

Browse files
committed
sync(zh): Phase 19 课程 58-69 全量同步翻译(多模态 + 高级 RAG)
上游新增 12 课 capstone projects(58-69),分两个 track: - E. 多模态(58-63):vision encoder、ViT、modality 对齐、cross-attention、VL 预训练、多模态评测 - F. 高级 RAG(64-69):chunking 策略、hybrid retrieval、reranker、query 改写、RAG 评测、端到端 RAG 本次同步: - A 类:12 课 docs/en.md 全部翻译为 docs/zh.md,英文原文已删除 - B 类:code/tests、quiz.json 1:1 同步上游,quiz.json 全部翻译为中文 - README.md / ROADMAP.md 补齐 58-69 课表格行(67 projects / 447 lessons badge) - 重新构建 site/data.js,480 课全量识别(Phase 19 = 67 课)
1 parent 583cb89 commit 8a74728

51 files changed

Lines changed: 8609 additions & 4 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
<p align="center">
66
<a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-1a1a1a?style=flat-square&labelColor=fafaf5" alt="MIT License"></a>
7-
<a href="ROADMAP.md"><img src="https://img.shields.io/badge/lessons-435-3553ff?style=flat-square&labelColor=fafaf5" alt="435 lessons"></a>
7+
<a href="ROADMAP.md"><img src="https://img.shields.io/badge/lessons-447-3553ff?style=flat-square&labelColor=fafaf5" alt="447 lessons"></a>
88
<a href="#contents"><img src="https://img.shields.io/badge/phases-20-3553ff?style=flat-square&labelColor=fafaf5" alt="20 phases"></a>
99
<a href="https://github.com/fancyboi999/ai-engineering-from-scratch-zh/stargazers"><img src="https://img.shields.io/github/stars/fancyboi999/ai-engineering-from-scratch-zh?style=flat-square&labelColor=fafaf5&color=3553ff" alt="GitHub stars"></a>
1010
<a href="https://aieng-zh.cn"><img src="https://img.shields.io/badge/website-live-3553ff?style=flat-square&labelColor=fafaf5" alt="Website"></a>
@@ -819,7 +819,7 @@ the agent went wrong and explain why...
819819
</details>
820820

821821
<details id="phase-19">
822-
<summary><b>Phase 19 — 综合项目</b> &nbsp;<code>55 projects</code>&nbsp; <em>2026 年的端到端可交付产品,每个 20-40 小时。</em></summary>
822+
<summary><b>Phase 19 — 综合项目</b> &nbsp;<code>67 projects</code>&nbsp; <em>2026 年的端到端可交付产品,每个 20-40 小时。</em></summary>
823823
<br/>
824824

825825
| # | Project | Combines | Lang |
@@ -879,6 +879,18 @@ the agent went wrong and explain why...
879879
| 55 | [评审循环](phases/19-capstone-projects/55-critic-loop/) | D. 自动研究 | Python |
880880
| 56 | [迭代调度器](phases/19-capstone-projects/56-iteration-scheduler/) | D. 自动研究 | Python |
881881
| 57 | [端到端研究 Demo](phases/19-capstone-projects/57-end-to-end-research-demo/) | D. 自动研究 | Python |
882+
| 58 | [Vision Encoder 的 Patch 切分](phases/19-capstone-projects/58-vision-encoder-patches/) | E. 多模态 | Python |
883+
| 59 | [Vision Transformer Encoder(ViT)](phases/19-capstone-projects/59-vit-transformer/) | E. 多模态 | Python |
884+
| 60 | [用 Projection Layer 做模态对齐](phases/19-capstone-projects/60-projection-layer-modality-align/) | E. 多模态 | Python |
885+
| 61 | [Cross-Attention 融合](phases/19-capstone-projects/61-cross-attention-fusion/) | E. 多模态 | Python |
886+
| 62 | [Vision-Language 预训练](phases/19-capstone-projects/62-vision-language-pretraining/) | E. 多模态 | Python |
887+
| 63 | [多模态评测](phases/19-capstone-projects/63-multimodal-eval/) | E. 多模态 | Python |
888+
| 64 | [Chunking 策略横向对比](phases/19-capstone-projects/64-chunking-strategies-advanced/) | F. 高级 RAG | Python |
889+
| 65 | [用 BM25 与 Dense Embedding 做 Hybrid Retrieval](phases/19-capstone-projects/65-hybrid-retrieval-bm25-dense/) | F. 高级 RAG | Python |
890+
| 66 | [Cross-Encoder Reranker](phases/19-capstone-projects/66-reranker-cross-encoder/) | F. 高级 RAG | Python |
891+
| 67 | [Query 改写:HyDE、Multi-Query 与 Decomposition](phases/19-capstone-projects/67-query-rewriting-hyde/) | F. 高级 RAG | Python |
892+
| 68 | [RAG 评测:Precision、Recall、MRR、nDCG 等](phases/19-capstone-projects/68-rag-eval-precision-recall/) | F. 高级 RAG | Python |
893+
| 69 | [端到端 RAG 系统](phases/19-capstone-projects/69-end-to-end-rag-system/) | F. 高级 RAG | Python |
882894

883895
</details>
884896

ROADMAP.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -571,9 +571,21 @@
571571
| 55 | [评审循环](phases/19-capstone-projects/55-critic-loop) || ~90 min |
572572
| 56 | [迭代调度器](phases/19-capstone-projects/56-iteration-scheduler) || ~90 min |
573573
| 57 | [端到端研究 Demo](phases/19-capstone-projects/57-end-to-end-research-demo) || ~90 min |
574+
| 58 | [Vision Encoder 的 Patch 切分](phases/19-capstone-projects/58-vision-encoder-patches) || ~90 min |
575+
| 59 | [Vision Transformer Encoder(ViT)](phases/19-capstone-projects/59-vit-transformer) || ~90 min |
576+
| 60 | [用 Projection Layer 做模态对齐](phases/19-capstone-projects/60-projection-layer-modality-align) || ~90 min |
577+
| 61 | [Cross-Attention 融合](phases/19-capstone-projects/61-cross-attention-fusion) || ~90 min |
578+
| 62 | [Vision-Language 预训练](phases/19-capstone-projects/62-vision-language-pretraining) || ~90 min |
579+
| 63 | [多模态评测](phases/19-capstone-projects/63-multimodal-eval) || ~90 min |
580+
| 64 | [Chunking 策略横向对比](phases/19-capstone-projects/64-chunking-strategies-advanced) || ~90 min |
581+
| 65 | [用 BM25 与 Dense Embedding 做 Hybrid Retrieval](phases/19-capstone-projects/65-hybrid-retrieval-bm25-dense) || ~90 min |
582+
| 66 | [Cross-Encoder Reranker](phases/19-capstone-projects/66-reranker-cross-encoder) || ~90 min |
583+
| 67 | [Query 改写:HyDE、Multi-Query 与 Decomposition](phases/19-capstone-projects/67-query-rewriting-hyde) || ~90 min |
584+
| 68 | [RAG 评测:Precision、Recall、MRR、nDCG 等](phases/19-capstone-projects/68-rag-eval-precision-recall) || ~90 min |
585+
| 69 | [端到端 RAG 系统](phases/19-capstone-projects/69-end-to-end-rag-system) || ~90 min |
574586

575587
---
576588

577-
**总计:20 个阶段,430+ 节课 | 430+ 已完成 | 预计约 1000 小时**
589+
**总计:20 个阶段,442+ 节课 | 442+ 已完成 | 预计约 1050 小时**
578590

579591
想出一份力?挑任意一节 ⬚ 课提交 PR。详见 [CONTRIBUTING.md](CONTRIBUTING.md)
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
"""Vision encoder front end: patch embedding plus 2D sinusoidal position.
2+
3+
Tokenizes a 224x224x3 image into a sequence of 196 patch tokens plus a CLS
4+
token. The patch projection is a Conv2d with kernel and stride equal to the
5+
patch size, which is numerically identical to flatten-then-linear. The
6+
position signal is a fixed 2D sinusoidal table; half the embedding dim encodes
7+
row position, the other half encodes column position, at multiple frequencies.
8+
9+
Run with: python3 main.py
10+
"""
11+
12+
from __future__ import annotations
13+
14+
import math
15+
from dataclasses import dataclass
16+
17+
import numpy as np
18+
import torch
19+
import torch.nn as nn
20+
21+
22+
@dataclass(frozen=True)
23+
class FrontEndConfig:
24+
image_size: int = 224
25+
patch_size: int = 16
26+
in_channels: int = 3
27+
hidden: int = 768
28+
29+
@property
30+
def grid_size(self) -> int:
31+
if self.image_size % self.patch_size != 0:
32+
raise ValueError(
33+
f"patch_size {self.patch_size} must divide image_size {self.image_size}"
34+
)
35+
return self.image_size // self.patch_size
36+
37+
@property
38+
def num_patches(self) -> int:
39+
return self.grid_size * self.grid_size
40+
41+
42+
def sinusoidal_2d(grid_h: int, grid_w: int, dim: int) -> torch.Tensor:
43+
"""Build a deterministic 2D sinusoidal position table of shape (grid_h * grid_w, dim).
44+
45+
Half of dim encodes row position, half encodes column position. Within each
46+
half, frequencies span the standard Transformer sin/cos band. Identical
47+
inputs always produce identical outputs, with no learned state.
48+
"""
49+
if dim % 4 != 0:
50+
raise ValueError(f"sinusoidal_2d dim must be divisible by 4, got {dim}")
51+
half = dim // 2
52+
quarter = half // 2
53+
54+
freq = torch.arange(quarter, dtype=torch.float32)
55+
inv = torch.exp(-math.log(10000.0) * freq / max(1, quarter))
56+
57+
rows = torch.arange(grid_h, dtype=torch.float32).unsqueeze(1) * inv.unsqueeze(0)
58+
cols = torch.arange(grid_w, dtype=torch.float32).unsqueeze(1) * inv.unsqueeze(0)
59+
60+
row_emb = torch.cat([torch.sin(rows), torch.cos(rows)], dim=1)
61+
col_emb = torch.cat([torch.sin(cols), torch.cos(cols)], dim=1)
62+
63+
table = torch.zeros(grid_h, grid_w, dim)
64+
table[:, :, :half] = row_emb.unsqueeze(1).expand(-1, grid_w, -1)
65+
table[:, :, half:] = col_emb.unsqueeze(0).expand(grid_h, -1, -1)
66+
return table.reshape(grid_h * grid_w, dim)
67+
68+
69+
class PatchEmbed(nn.Module):
70+
"""Patch projection as a strided Conv2d.
71+
72+
Output shape on a (B, C, H, W) input is (B, N, hidden) where
73+
N = (H / patch_size) * (W / patch_size).
74+
"""
75+
76+
def __init__(self, cfg: FrontEndConfig) -> None:
77+
super().__init__()
78+
self.cfg = cfg
79+
self.proj = nn.Conv2d(
80+
cfg.in_channels,
81+
cfg.hidden,
82+
kernel_size=cfg.patch_size,
83+
stride=cfg.patch_size,
84+
bias=True,
85+
)
86+
87+
def forward(self, x: torch.Tensor) -> torch.Tensor:
88+
if x.dim() != 4:
89+
raise ValueError(f"expected 4D input (B,C,H,W), got shape {tuple(x.shape)}")
90+
if x.shape[1] != self.cfg.in_channels:
91+
raise ValueError(
92+
f"channel mismatch: got {x.shape[1]}, expected {self.cfg.in_channels}"
93+
)
94+
if x.shape[2] != self.cfg.image_size or x.shape[3] != self.cfg.image_size:
95+
raise ValueError(
96+
f"spatial mismatch: got {tuple(x.shape[2:])}, expected "
97+
f"({self.cfg.image_size}, {self.cfg.image_size})"
98+
)
99+
out = self.proj(x)
100+
b = out.shape[0]
101+
out = out.flatten(2).transpose(1, 2)
102+
return out
103+
104+
105+
class VisionFrontEnd(nn.Module):
106+
"""Patch embed + CLS prepend + 2D sinusoidal position.
107+
108+
Output shape: (B, num_patches + 1, hidden).
109+
"""
110+
111+
def __init__(self, cfg: FrontEndConfig) -> None:
112+
super().__init__()
113+
self.cfg = cfg
114+
self.patch = PatchEmbed(cfg)
115+
self.cls_token = nn.Parameter(torch.zeros(1, 1, cfg.hidden))
116+
nn.init.trunc_normal_(self.cls_token, std=0.02)
117+
118+
pos = sinusoidal_2d(cfg.grid_size, cfg.grid_size, cfg.hidden)
119+
cls_pos = torch.zeros(1, cfg.hidden)
120+
full = torch.cat([cls_pos, pos], dim=0).unsqueeze(0)
121+
self.register_buffer("pos_embed", full, persistent=False)
122+
123+
def forward(self, x: torch.Tensor) -> torch.Tensor:
124+
tokens = self.patch(x)
125+
b = tokens.shape[0]
126+
cls = self.cls_token.expand(b, -1, -1)
127+
tokens = torch.cat([cls, tokens], dim=1)
128+
tokens = tokens + self.pos_embed
129+
return tokens
130+
131+
132+
def synthesize_image(seed: int, image_size: int = 224, channels: int = 3) -> torch.Tensor:
133+
"""Build a deterministic 1x3x224x224 fixture from numpy.random.
134+
135+
Values are in [0, 1] float32. Adding a smooth gradient on top of noise gives
136+
the patch projection something with both high and low frequency content to
137+
summarize.
138+
"""
139+
rng = np.random.default_rng(seed)
140+
noise = rng.standard_normal((channels, image_size, image_size)).astype("float32") * 0.1
141+
y_coords = np.linspace(0.0, 1.0, image_size, dtype="float32")
142+
x_coords = np.linspace(0.0, 1.0, image_size, dtype="float32")
143+
gx, gy = np.meshgrid(x_coords, y_coords, indexing="xy")
144+
gradient = np.stack([gx, gy, (gx + gy) * 0.5], axis=0).astype("float32")
145+
img = np.clip(gradient + noise + 0.5, 0.0, 1.0)
146+
return torch.from_numpy(img).unsqueeze(0)
147+
148+
149+
def unfold_then_linear(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, patch_size: int) -> torch.Tensor:
150+
"""Reference implementation of patch projection via unfold + matmul.
151+
152+
Used by the tests to assert that the Conv2d projection matches the
153+
flatten-then-linear math.
154+
"""
155+
if x.dim() != 4:
156+
raise ValueError(f"expected 4D input, got {tuple(x.shape)}")
157+
patches = x.unfold(2, patch_size, patch_size).unfold(3, patch_size, patch_size)
158+
b, c, gh, gw, ph, pw = patches.shape
159+
flat = patches.permute(0, 2, 3, 1, 4, 5).reshape(b, gh * gw, c * ph * pw)
160+
w_flat = weight.reshape(weight.shape[0], -1)
161+
return flat @ w_flat.T + bias
162+
163+
164+
def describe_token_norms(tokens: torch.Tensor, max_show: int = 8) -> str:
165+
"""Print the L2 norm of the first few tokens for sanity inspection."""
166+
norms = tokens.detach().norm(dim=-1)[0].tolist()
167+
head = norms[:max_show]
168+
return ", ".join(f"{v:.3f}" for v in head)
169+
170+
171+
def main() -> None:
172+
print("=" * 60)
173+
print("VISION ENCODER PATCHES")
174+
print("=" * 60)
175+
176+
cfg = FrontEndConfig()
177+
print(f" image size : {cfg.image_size}")
178+
print(f" patch size : {cfg.patch_size}")
179+
print(f" grid size : {cfg.grid_size}x{cfg.grid_size}")
180+
print(f" num patches: {cfg.num_patches}")
181+
print(f" hidden : {cfg.hidden}")
182+
print(f" seq length : {cfg.num_patches + 1} (includes CLS)")
183+
184+
torch.manual_seed(0)
185+
img = synthesize_image(seed=0)
186+
print(f"\nfixture image shape : {tuple(img.shape)}")
187+
print(f"fixture image dtype : {img.dtype}")
188+
print(f"fixture pixel range : [{img.min().item():.3f}, {img.max().item():.3f}]")
189+
190+
model = VisionFrontEnd(cfg).eval()
191+
n_params = sum(p.numel() for p in model.parameters())
192+
print(f"\nfront-end params : {n_params:,}")
193+
194+
with torch.no_grad():
195+
tokens = model(img)
196+
197+
print(f"output token shape : {tuple(tokens.shape)}")
198+
print(f"CLS token norm : {tokens[0, 0].norm().item():.3f}")
199+
print(f"first 8 token norms : {describe_token_norms(tokens)}")
200+
201+
print("\nposition embedding row signature:")
202+
pos_row = model.pos_embed[0, 1, :8].tolist()
203+
print(" pos[1, :8] =", ", ".join(f"{v:+.3f}" for v in pos_row))
204+
205+
print("\nbatch consistency check:")
206+
img_b4 = synthesize_image(seed=1).repeat(4, 1, 1, 1)
207+
with torch.no_grad():
208+
out_b4 = model(img_b4)
209+
print(f" batch=4 output shape: {tuple(out_b4.shape)}")
210+
drift = (out_b4 - out_b4[0:1]).abs().max().item()
211+
print(f" max drift across identical batch rows: {drift:.6f}")
212+
213+
print("\nunfold reference vs Conv2d projection:")
214+
weight = model.patch.proj.weight.detach()
215+
bias = model.patch.proj.bias.detach()
216+
ref = unfold_then_linear(img, weight, bias, cfg.patch_size)
217+
conv = model.patch(img)
218+
diff = (ref - conv).abs().max().item()
219+
print(f" max abs diff : {diff:.6e}")
220+
if diff < 1e-4:
221+
print(" ok: unfold reference matches Conv2d to float tolerance")
222+
else:
223+
print(" FAIL: projection drifts from reference")
224+
225+
print("\ndone.")
226+
227+
228+
if __name__ == "__main__":
229+
main()
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
"""Unit tests for the vision encoder front end."""
2+
3+
from __future__ import annotations
4+
5+
import unittest
6+
7+
import torch
8+
9+
from main import (
10+
FrontEndConfig,
11+
PatchEmbed,
12+
VisionFrontEnd,
13+
sinusoidal_2d,
14+
synthesize_image,
15+
unfold_then_linear,
16+
)
17+
18+
19+
class TestPatchEmbed(unittest.TestCase):
20+
def test_patch_count_matches_grid(self) -> None:
21+
cfg = FrontEndConfig(image_size=224, patch_size=16, hidden=64)
22+
self.assertEqual(cfg.num_patches, 14 * 14)
23+
cfg2 = FrontEndConfig(image_size=96, patch_size=16, hidden=64)
24+
self.assertEqual(cfg2.num_patches, 6 * 6)
25+
26+
def test_output_shape_includes_cls(self) -> None:
27+
cfg = FrontEndConfig(image_size=64, patch_size=16, hidden=32)
28+
model = VisionFrontEnd(cfg).eval()
29+
img = torch.randn(2, 3, 64, 64)
30+
with torch.no_grad():
31+
out = model(img)
32+
self.assertEqual(out.shape, (2, cfg.num_patches + 1, cfg.hidden))
33+
34+
def test_conv2d_matches_unfold_reference(self) -> None:
35+
cfg = FrontEndConfig(image_size=64, patch_size=16, hidden=32)
36+
torch.manual_seed(11)
37+
patch = PatchEmbed(cfg).eval()
38+
img = torch.randn(1, 3, 64, 64)
39+
weight = patch.proj.weight.detach()
40+
bias = patch.proj.bias.detach()
41+
with torch.no_grad():
42+
ref = unfold_then_linear(img, weight, bias, cfg.patch_size)
43+
conv = patch(img)
44+
self.assertTrue(torch.allclose(ref, conv, atol=1e-5))
45+
46+
47+
class TestPositionEmbedding(unittest.TestCase):
48+
def test_sinusoidal_deterministic(self) -> None:
49+
a = sinusoidal_2d(7, 7, 64)
50+
b = sinusoidal_2d(7, 7, 64)
51+
self.assertTrue(torch.equal(a, b))
52+
53+
def test_sinusoidal_shape(self) -> None:
54+
table = sinusoidal_2d(14, 14, 64)
55+
self.assertEqual(table.shape, (196, 64))
56+
57+
def test_sinusoidal_requires_div_by_four(self) -> None:
58+
with self.assertRaises(ValueError):
59+
sinusoidal_2d(4, 4, 30)
60+
61+
62+
class TestVisionFrontEnd(unittest.TestCase):
63+
def test_cls_token_broadcasts_without_leakage(self) -> None:
64+
cfg = FrontEndConfig(image_size=32, patch_size=16, hidden=32)
65+
model = VisionFrontEnd(cfg).eval()
66+
img = torch.randn(3, 3, 32, 32)
67+
with torch.no_grad():
68+
out = model(img)
69+
cls_norms = out[:, 0].norm(dim=-1)
70+
self.assertTrue(torch.all(cls_norms > 0))
71+
diffs = (out[:, 0] - out[0:1, 0]).abs()
72+
self.assertTrue(diffs.max().item() < 1e-3)
73+
74+
def test_rejects_wrong_spatial_size(self) -> None:
75+
cfg = FrontEndConfig(image_size=32, patch_size=16, hidden=32)
76+
model = VisionFrontEnd(cfg).eval()
77+
with self.assertRaises(ValueError):
78+
model(torch.randn(1, 3, 48, 48))
79+
80+
def test_synthesize_image_is_deterministic(self) -> None:
81+
a = synthesize_image(seed=7)
82+
b = synthesize_image(seed=7)
83+
self.assertTrue(torch.equal(a, b))
84+
self.assertEqual(a.shape, (1, 3, 224, 224))
85+
86+
87+
if __name__ == "__main__":
88+
unittest.main()

0 commit comments

Comments
 (0)