fancyboi999
diff --git a/‎README.md‎
Lines changed: 14 additions & 2 deletions b/‎README.md‎
Lines changed: 14 additions & 2 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 13 additions & 1 deletion b/‎ROADMAP.md‎
Lines changed: 13 additions & 1 deletion
diff --git a/‎phases/19-capstone-projects/58-vision-encoder-patches/code/main.py‎
Lines changed: 229 additions & 0 deletions b/‎phases/19-capstone-projects/58-vision-encoder-patches/code/main.py‎
Lines changed: 229 additions & 0 deletions
diff --git a/‎phases/19-capstone-projects/58-vision-encoder-patches/code/test_main.py‎
Lines changed: 88 additions & 0 deletions b/‎phases/19-capstone-projects/58-vision-encoder-patches/code/test_main.py‎
Lines changed: 88 additions & 0 deletions
@@ -4,7 +4,7 @@
 
 <p align="center">
   <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-1a1a1a?style=flat-square&labelColor=fafaf5" alt="MIT License"></a>
-  <a href="ROADMAP.md"><img src="https://img.shields.io/badge/lessons-435-3553ff?style=flat-square&labelColor=fafaf5" alt="435 lessons"></a>
+  <a href="ROADMAP.md"><img src="https://img.shields.io/badge/lessons-447-3553ff?style=flat-square&labelColor=fafaf5" alt="447 lessons"></a>
   <a href="#contents"><img src="https://img.shields.io/badge/phases-20-3553ff?style=flat-square&labelColor=fafaf5" alt="20 phases"></a>
   <a href="https://github.com/fancyboi999/ai-engineering-from-scratch-zh/stargazers"><img src="https://img.shields.io/github/stars/fancyboi999/ai-engineering-from-scratch-zh?style=flat-square&labelColor=fafaf5&color=3553ff" alt="GitHub stars"></a>
   <a href="https://aieng-zh.cn"><img src="https://img.shields.io/badge/website-live-3553ff?style=flat-square&labelColor=fafaf5" alt="Website"></a>
@@ -819,7 +819,7 @@ the agent went wrong and explain why...
 </details>
 
 <details id="phase-19">
-<summary><b>Phase 19 — 综合项目</b> &nbsp;<code>55 projects</code>&nbsp; <em>2026 年的端到端可交付产品，每个 20-40 小时。</em></summary>
+<summary><b>Phase 19 — 综合项目</b> &nbsp;<code>67 projects</code>&nbsp; <em>2026 年的端到端可交付产品，每个 20-40 小时。</em></summary>
 <br/>
 
 | # | Project | Combines | Lang |
@@ -879,6 +879,18 @@ the agent went wrong and explain why...
 | 55 | [评审循环](phases/19-capstone-projects/55-critic-loop/) | D. 自动研究 | Python |
 | 56 | [迭代调度器](phases/19-capstone-projects/56-iteration-scheduler/) | D. 自动研究 | Python |
 | 57 | [端到端研究 Demo](phases/19-capstone-projects/57-end-to-end-research-demo/) | D. 自动研究 | Python |
+| 58 | [Vision Encoder 的 Patch 切分](phases/19-capstone-projects/58-vision-encoder-patches/) | E. 多模态 | Python |
+| 59 | [Vision Transformer Encoder（ViT）](phases/19-capstone-projects/59-vit-transformer/) | E. 多模态 | Python |
+| 60 | [用 Projection Layer 做模态对齐](phases/19-capstone-projects/60-projection-layer-modality-align/) | E. 多模态 | Python |
+| 61 | [Cross-Attention 融合](phases/19-capstone-projects/61-cross-attention-fusion/) | E. 多模态 | Python |
+| 62 | [Vision-Language 预训练](phases/19-capstone-projects/62-vision-language-pretraining/) | E. 多模态 | Python |
+| 63 | [多模态评测](phases/19-capstone-projects/63-multimodal-eval/) | E. 多模态 | Python |
+| 64 | [Chunking 策略横向对比](phases/19-capstone-projects/64-chunking-strategies-advanced/) | F. 高级 RAG | Python |
+| 65 | [用 BM25 与 Dense Embedding 做 Hybrid Retrieval](phases/19-capstone-projects/65-hybrid-retrieval-bm25-dense/) | F. 高级 RAG | Python |
+| 66 | [Cross-Encoder Reranker](phases/19-capstone-projects/66-reranker-cross-encoder/) | F. 高级 RAG | Python |
+| 67 | [Query 改写：HyDE、Multi-Query 与 Decomposition](phases/19-capstone-projects/67-query-rewriting-hyde/) | F. 高级 RAG | Python |
+| 68 | [RAG 评测：Precision、Recall、MRR、nDCG 等](phases/19-capstone-projects/68-rag-eval-precision-recall/) | F. 高级 RAG | Python |
+| 69 | [端到端 RAG 系统](phases/19-capstone-projects/69-end-to-end-rag-system/) | F. 高级 RAG | Python |
 
 </details>
 
 
@@ -571,9 +571,21 @@
 | 55 | [评审循环](phases/19-capstone-projects/55-critic-loop) | ✅ | ~90 min |
 | 56 | [迭代调度器](phases/19-capstone-projects/56-iteration-scheduler) | ✅ | ~90 min |
 | 57 | [端到端研究 Demo](phases/19-capstone-projects/57-end-to-end-research-demo) | ✅ | ~90 min |
+| 58 | [Vision Encoder 的 Patch 切分](phases/19-capstone-projects/58-vision-encoder-patches) | ✅ | ~90 min |
+| 59 | [Vision Transformer Encoder（ViT）](phases/19-capstone-projects/59-vit-transformer) | ✅ | ~90 min |
+| 60 | [用 Projection Layer 做模态对齐](phases/19-capstone-projects/60-projection-layer-modality-align) | ✅ | ~90 min |
+| 61 | [Cross-Attention 融合](phases/19-capstone-projects/61-cross-attention-fusion) | ✅ | ~90 min |
+| 62 | [Vision-Language 预训练](phases/19-capstone-projects/62-vision-language-pretraining) | ✅ | ~90 min |
+| 63 | [多模态评测](phases/19-capstone-projects/63-multimodal-eval) | ✅ | ~90 min |
+| 64 | [Chunking 策略横向对比](phases/19-capstone-projects/64-chunking-strategies-advanced) | ✅ | ~90 min |
+| 65 | [用 BM25 与 Dense Embedding 做 Hybrid Retrieval](phases/19-capstone-projects/65-hybrid-retrieval-bm25-dense) | ✅ | ~90 min |
+| 66 | [Cross-Encoder Reranker](phases/19-capstone-projects/66-reranker-cross-encoder) | ✅ | ~90 min |
+| 67 | [Query 改写：HyDE、Multi-Query 与 Decomposition](phases/19-capstone-projects/67-query-rewriting-hyde) | ✅ | ~90 min |
+| 68 | [RAG 评测：Precision、Recall、MRR、nDCG 等](phases/19-capstone-projects/68-rag-eval-precision-recall) | ✅ | ~90 min |
+| 69 | [端到端 RAG 系统](phases/19-capstone-projects/69-end-to-end-rag-system) | ✅ | ~90 min |
 
 ---
 
-**总计：20 个阶段，430+ 节课 | 430+ 已完成 | 预计约 1000 小时**
+**总计：20 个阶段，442+ 节课 | 442+ 已完成 | 预计约 1050 小时**
 
 想出一份力？挑任意一节 ⬚ 课提交 PR。详见 [CONTRIBUTING.md](CONTRIBUTING.md)。
@@ -0,0 +1,229 @@
+"""Vision encoder front end: patch embedding plus 2D sinusoidal position.
+
+Tokenizes a 224x224x3 image into a sequence of 196 patch tokens plus a CLS
+token. The patch projection is a Conv2d with kernel and stride equal to the
+patch size, which is numerically identical to flatten-then-linear. The
+position signal is a fixed 2D sinusoidal table; half the embedding dim encodes
+row position, the other half encodes column position, at multiple frequencies.
+
+Run with: python3 main.py
+"""
+
+from __future__ import annotations
+
+import math
+from dataclasses import dataclass
+
+import numpy as np
+import torch
+import torch.nn as nn
+
+
+@dataclass(frozen=True)
+class FrontEndConfig:
+    image_size: int = 224
+    patch_size: int = 16
+    in_channels: int = 3
+    hidden: int = 768
+
+    @property
+    def grid_size(self) -> int:
+        if self.image_size % self.patch_size != 0:
+            raise ValueError(
+                f"patch_size {self.patch_size} must divide image_size {self.image_size}"
+            )
+        return self.image_size // self.patch_size
+
+    @property
+    def num_patches(self) -> int:
+        return self.grid_size * self.grid_size
+
+
+def sinusoidal_2d(grid_h: int, grid_w: int, dim: int) -> torch.Tensor:
+    """Build a deterministic 2D sinusoidal position table of shape (grid_h * grid_w, dim).
+
+    Half of dim encodes row position, half encodes column position. Within each
+    half, frequencies span the standard Transformer sin/cos band. Identical
+    inputs always produce identical outputs, with no learned state.
+    """
+    if dim % 4 != 0:
+        raise ValueError(f"sinusoidal_2d dim must be divisible by 4, got {dim}")
+    half = dim // 2
+    quarter = half // 2
+
+    freq = torch.arange(quarter, dtype=torch.float32)
+    inv = torch.exp(-math.log(10000.0) * freq / max(1, quarter))
+
+    rows = torch.arange(grid_h, dtype=torch.float32).unsqueeze(1) * inv.unsqueeze(0)
+    cols = torch.arange(grid_w, dtype=torch.float32).unsqueeze(1) * inv.unsqueeze(0)
+
+    row_emb = torch.cat([torch.sin(rows), torch.cos(rows)], dim=1)
+    col_emb = torch.cat([torch.sin(cols), torch.cos(cols)], dim=1)
+
+    table = torch.zeros(grid_h, grid_w, dim)
+    table[:, :, :half] = row_emb.unsqueeze(1).expand(-1, grid_w, -1)
+    table[:, :, half:] = col_emb.unsqueeze(0).expand(grid_h, -1, -1)
+    return table.reshape(grid_h * grid_w, dim)
+
+
+class PatchEmbed(nn.Module):
+    """Patch projection as a strided Conv2d.
+
+    Output shape on a (B, C, H, W) input is (B, N, hidden) where
+    N = (H / patch_size) * (W / patch_size).
+    """
+
+    def __init__(self, cfg: FrontEndConfig) -> None:
+        super().__init__()
+        self.cfg = cfg
+        self.proj = nn.Conv2d(
+            cfg.in_channels,
+            cfg.hidden,
+            kernel_size=cfg.patch_size,
+            stride=cfg.patch_size,
+            bias=True,
+        )
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if x.dim() != 4:
+            raise ValueError(f"expected 4D input (B,C,H,W), got shape {tuple(x.shape)}")
+        if x.shape[1] != self.cfg.in_channels:
+            raise ValueError(
+                f"channel mismatch: got {x.shape[1]}, expected {self.cfg.in_channels}"
+            )
+        if x.shape[2] != self.cfg.image_size or x.shape[3] != self.cfg.image_size:
+            raise ValueError(
+                f"spatial mismatch: got {tuple(x.shape[2:])}, expected "
+                f"({self.cfg.image_size}, {self.cfg.image_size})"
+            )
+        out = self.proj(x)
+        b = out.shape[0]
+        out = out.flatten(2).transpose(1, 2)
+        return out
+
+
+class VisionFrontEnd(nn.Module):
+    """Patch embed + CLS prepend + 2D sinusoidal position.
+
+    Output shape: (B, num_patches + 1, hidden).
+    """
+
+    def __init__(self, cfg: FrontEndConfig) -> None:
+        super().__init__()
+        self.cfg = cfg
+        self.patch = PatchEmbed(cfg)
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, cfg.hidden))
+        nn.init.trunc_normal_(self.cls_token, std=0.02)
+
+        pos = sinusoidal_2d(cfg.grid_size, cfg.grid_size, cfg.hidden)
+        cls_pos = torch.zeros(1, cfg.hidden)
+        full = torch.cat([cls_pos, pos], dim=0).unsqueeze(0)
+        self.register_buffer("pos_embed", full, persistent=False)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        tokens = self.patch(x)
+        b = tokens.shape[0]
+        cls = self.cls_token.expand(b, -1, -1)
+        tokens = torch.cat([cls, tokens], dim=1)
+        tokens = tokens + self.pos_embed
+        return tokens
+
+
+def synthesize_image(seed: int, image_size: int = 224, channels: int = 3) -> torch.Tensor:
+    """Build a deterministic 1x3x224x224 fixture from numpy.random.
+
+    Values are in [0, 1] float32. Adding a smooth gradient on top of noise gives
+    the patch projection something with both high and low frequency content to
+    summarize.
+    """
+    rng = np.random.default_rng(seed)
+    noise = rng.standard_normal((channels, image_size, image_size)).astype("float32") * 0.1
+    y_coords = np.linspace(0.0, 1.0, image_size, dtype="float32")
+    x_coords = np.linspace(0.0, 1.0, image_size, dtype="float32")
+    gx, gy = np.meshgrid(x_coords, y_coords, indexing="xy")
+    gradient = np.stack([gx, gy, (gx + gy) * 0.5], axis=0).astype("float32")
+    img = np.clip(gradient + noise + 0.5, 0.0, 1.0)
+    return torch.from_numpy(img).unsqueeze(0)
+
+
+def unfold_then_linear(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, patch_size: int) -> torch.Tensor:
+    """Reference implementation of patch projection via unfold + matmul.
+
+    Used by the tests to assert that the Conv2d projection matches the
+    flatten-then-linear math.
+    """
+    if x.dim() != 4:
+        raise ValueError(f"expected 4D input, got {tuple(x.shape)}")
+    patches = x.unfold(2, patch_size, patch_size).unfold(3, patch_size, patch_size)
+    b, c, gh, gw, ph, pw = patches.shape
+    flat = patches.permute(0, 2, 3, 1, 4, 5).reshape(b, gh * gw, c * ph * pw)
+    w_flat = weight.reshape(weight.shape[0], -1)
+    return flat @ w_flat.T + bias
+
+
+def describe_token_norms(tokens: torch.Tensor, max_show: int = 8) -> str:
+    """Print the L2 norm of the first few tokens for sanity inspection."""
+    norms = tokens.detach().norm(dim=-1)[0].tolist()
+    head = norms[:max_show]
+    return ", ".join(f"{v:.3f}" for v in head)
+
+
+def main() -> None:
+    print("=" * 60)
+    print("VISION ENCODER PATCHES")
+    print("=" * 60)
+
+    cfg = FrontEndConfig()
+    print(f"  image size : {cfg.image_size}")
+    print(f"  patch size : {cfg.patch_size}")
+    print(f"  grid size  : {cfg.grid_size}x{cfg.grid_size}")
+    print(f"  num patches: {cfg.num_patches}")
+    print(f"  hidden     : {cfg.hidden}")
+    print(f"  seq length : {cfg.num_patches + 1} (includes CLS)")
+
+    torch.manual_seed(0)
+    img = synthesize_image(seed=0)
+    print(f"\nfixture image shape  : {tuple(img.shape)}")
+    print(f"fixture image dtype  : {img.dtype}")
+    print(f"fixture pixel range  : [{img.min().item():.3f}, {img.max().item():.3f}]")
+
+    model = VisionFrontEnd(cfg).eval()
+    n_params = sum(p.numel() for p in model.parameters())
+    print(f"\nfront-end params     : {n_params:,}")
+
+    with torch.no_grad():
+        tokens = model(img)
+
+    print(f"output token shape   : {tuple(tokens.shape)}")
+    print(f"CLS token norm       : {tokens[0, 0].norm().item():.3f}")
+    print(f"first 8 token norms  : {describe_token_norms(tokens)}")
+
+    print("\nposition embedding row signature:")
+    pos_row = model.pos_embed[0, 1, :8].tolist()
+    print("  pos[1, :8] =", ", ".join(f"{v:+.3f}" for v in pos_row))
+
+    print("\nbatch consistency check:")
+    img_b4 = synthesize_image(seed=1).repeat(4, 1, 1, 1)
+    with torch.no_grad():
+        out_b4 = model(img_b4)
+    print(f"  batch=4 output shape: {tuple(out_b4.shape)}")
+    drift = (out_b4 - out_b4[0:1]).abs().max().item()
+    print(f"  max drift across identical batch rows: {drift:.6f}")
+
+    print("\nunfold reference vs Conv2d projection:")
+    weight = model.patch.proj.weight.detach()
+    bias = model.patch.proj.bias.detach()
+    ref = unfold_then_linear(img, weight, bias, cfg.patch_size)
+    conv = model.patch(img)
+    diff = (ref - conv).abs().max().item()
+    print(f"  max abs diff : {diff:.6e}")
+    if diff < 1e-4:
+        print("  ok: unfold reference matches Conv2d to float tolerance")
+    else:
+        print("  FAIL: projection drifts from reference")
+
+    print("\ndone.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,88 @@
+"""Unit tests for the vision encoder front end."""
+
+from __future__ import annotations
+
+import unittest
+
+import torch
+
+from main import (
+    FrontEndConfig,
+    PatchEmbed,
+    VisionFrontEnd,
+    sinusoidal_2d,
+    synthesize_image,
+    unfold_then_linear,
+)
+
+
+class TestPatchEmbed(unittest.TestCase):
+    def test_patch_count_matches_grid(self) -> None:
+        cfg = FrontEndConfig(image_size=224, patch_size=16, hidden=64)
+        self.assertEqual(cfg.num_patches, 14 * 14)
+        cfg2 = FrontEndConfig(image_size=96, patch_size=16, hidden=64)
+        self.assertEqual(cfg2.num_patches, 6 * 6)
+
+    def test_output_shape_includes_cls(self) -> None:
+        cfg = FrontEndConfig(image_size=64, patch_size=16, hidden=32)
+        model = VisionFrontEnd(cfg).eval()
+        img = torch.randn(2, 3, 64, 64)
+        with torch.no_grad():
+            out = model(img)
+        self.assertEqual(out.shape, (2, cfg.num_patches + 1, cfg.hidden))
+
+    def test_conv2d_matches_unfold_reference(self) -> None:
+        cfg = FrontEndConfig(image_size=64, patch_size=16, hidden=32)
+        torch.manual_seed(11)
+        patch = PatchEmbed(cfg).eval()
+        img = torch.randn(1, 3, 64, 64)
+        weight = patch.proj.weight.detach()
+        bias = patch.proj.bias.detach()
+        with torch.no_grad():
+            ref = unfold_then_linear(img, weight, bias, cfg.patch_size)
+            conv = patch(img)
+        self.assertTrue(torch.allclose(ref, conv, atol=1e-5))
+
+
+class TestPositionEmbedding(unittest.TestCase):
+    def test_sinusoidal_deterministic(self) -> None:
+        a = sinusoidal_2d(7, 7, 64)
+        b = sinusoidal_2d(7, 7, 64)
+        self.assertTrue(torch.equal(a, b))
+
+    def test_sinusoidal_shape(self) -> None:
+        table = sinusoidal_2d(14, 14, 64)
+        self.assertEqual(table.shape, (196, 64))
+
+    def test_sinusoidal_requires_div_by_four(self) -> None:
+        with self.assertRaises(ValueError):
+            sinusoidal_2d(4, 4, 30)
+
+
+class TestVisionFrontEnd(unittest.TestCase):
+    def test_cls_token_broadcasts_without_leakage(self) -> None:
+        cfg = FrontEndConfig(image_size=32, patch_size=16, hidden=32)
+        model = VisionFrontEnd(cfg).eval()
+        img = torch.randn(3, 3, 32, 32)
+        with torch.no_grad():
+            out = model(img)
+        cls_norms = out[:, 0].norm(dim=-1)
+        self.assertTrue(torch.all(cls_norms > 0))
+        diffs = (out[:, 0] - out[0:1, 0]).abs()
+        self.assertTrue(diffs.max().item() < 1e-3)
+
+    def test_rejects_wrong_spatial_size(self) -> None:
+        cfg = FrontEndConfig(image_size=32, patch_size=16, hidden=32)
+        model = VisionFrontEnd(cfg).eval()
+        with self.assertRaises(ValueError):
+            model(torch.randn(1, 3, 48, 48))
+
+    def test_synthesize_image_is_deterministic(self) -> None:
+        a = synthesize_image(seed=7)
+        b = synthesize_image(seed=7)
+        self.assertTrue(torch.equal(a, b))
+        self.assertEqual(a.shape, (1, 3, 224, 224))
+
+
+if __name__ == "__main__":
+    unittest.main()